Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.

A database for storing, querying and doing stats on credential leaks

Notifications You must be signed in to change notification settings

EC-DIGIT-CSIRC/credentialLeakDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

credentialleakDB

Pylint flak8 and pytest CodeQL Quality Gate Status Maintainability Rating Reliability Rating Security Rating codecov

A database structure to store leaked credentials.

Think: our own, internal HaveIBeenPwned database.

Why?

  1. To quickly find duplicates before sending it on to further process the data
  2. To have a way to load diverse credential breaches into a common structure and do common queries on it
  3. To quickly generate statistics on credential leaks
  4. To have a well defined interface to pass on data to pass it on to other automation steps

Documentation

Installation

Docker

Via pip and venv

git clone https://github.com/EC-DIGIT-CSIRC/credentialLeakDB.git
cd credentialLeakDB
# create a virtualenv
virtualenv --python=python3.7 venv
source venv/bin/activate
pip install -r requirements.txt

Next, make sure the following files exist:

  • VIPs.txt ... a \n separated list of email addresses which you would consider VIPs.
  • api/config.py ... see below

Database structure

Search in Confluence for "credentialLeakDB" in the Automation space.

SQL structure: db.sql

The EER diagram intentionally got simplified a lot. If we are going to store billions of repeated text datatype records, we can go back to more normalization. For now, however, this seems to be enough.

EER Diagram

Meaning of the fields

Table leak

Column Type Collation Nullable Description
id integer not null primary key. Auto-generated.
breach_ts timestamp with time zone If known, the timestamp when the breach happened.
source_publish_ts timestamp with time zone The timestamp according when the source (f.ex. Spycloud) published the data.
ingestion_ts timestamp with time zone not null The timestamp when we ingested the data.
summary text not null A short summary (slug) of the leak. Used for displaying it somewhere
ticket_id text
reporter_name text The name of the reporter where we got the notification from. E.g. CERT-eu, Spycloud, etc... Who sent us the data?
source_name text The name of the source where this leak came from. Either the name of a collection or some other name.
Indexes:
    "leak_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "leak_data" CONSTRAINT "leak_data_leak_id_fkey" FOREIGN KEY (leak_id) REFERENCES leak(id)

Table leak_data

Column Type Collation Nullable Description
id integer not null primary key, auto-generated.
leak_id integer not null references a leak(id)
email text not null The email address associated with the leak.
password text not null Either the encrypted or unencrypted password. If the unencrypted password is available, that is what is going to be in this field.
password_plain text The plaintext password, if known.
password_hashed text The hashed password, if known.
hash_algo text If we can determine the hashing algo and the password_hashed field is set, for example "md5" or "sha1"
ticket_id text References the ticket systems' ticket ID associated with handling this credential leak . This ticket could contain infos on how we contacted the affected user.
email_verified boolean If the email address was verified if it does exist and is active
password_verified_ok boolean Was that password still valid / active?
ip inet IP address of the client PC in case of a password stealer.
domain text Domain address of the user's email address.
browser text If the password was leaked via a password stealer malware, then the browser of the user goes here. Otherwise empty.
malware_name text If the password was leaked via a password stealer malware, then the malware name goes here. Otherwise empty.
infected_machine text If the password was leaked via a password stealer malware, then the infected (Windows) PC name (some ID for the machine) goes here.
dg text not null The affected DG (in other organisations, this would be called "department")
count_seen integer How often did we already see this unique combination (leak, email, password, domain). I.e. this is a duplicate counter.
Indexes:
    "leak_data_pkey" PRIMARY KEY, btree (id)
    "constr_unique_leak_data_leak_id_email_password_domain" UNIQUE CONSTRAINT, btree (leak_id, email, password, domain)
    "idx_leak_data_unique_leak_id_email_password_domain" UNIQUE, btree (leak_id, email, password, domain)
    "idx_leak_data_dg" btree (dg)
    "idx_leak_data_email" btree (upper(email))
    "idx_leak_data_email_password_machine" btree (email, password, infected_machine)
    "idx_leak_data_malware_name" btree (malware_name)
Foreign-key constraints:
    "leak_data_leak_id_fkey" FOREIGN KEY (leak_id) REFERENCES leak(id)

Usage of the API

Here is how to use the API endpoints: you can start the server (follow the instructions below) and go to $servername/docs where $servername is of course the domain / IP address you installed it under. The docs/ endpoint hosts a swagger / OpenAPI 3

GET parameters

These are pretty self-explanatory thanks to the swagger UI.

POST and PUT

For HTTP POST (a.k.a INSERT into DB) you will need to provide the following JSON info:

leak object

{
  "id": 0,
  "ticket_id": "string",
  "summary": "string",
  "reporter_name": "string",
  "source_name": "string",
  "breach_ts": "2021-03-29T12:21:56.370Z",
  "source_publish_ts": "2021-03-29T12:21:56.370Z"
}

The id field only needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row. The id is the internal automatically generated primary key (ID) and will be assigned. So when you use the HTTP POST /leak endpoint, please leave out id. The answer will be a JSON array with a dict with the id inside, such as:

{
  "meta": {
    "version": "0.5",
    "duration": 0.006,
    "count": 1
  },
  "data": [
    {
      "id": 18
    }
  ],
  "error": null
}

Meaning: the version of the API was 0.5, the query duration was 0.006 sec (6 millisec), one answer. The data array contains one element: id=18. Meaning, the ID of the inserted leak object was 18. You can now reference this in the leak_data object insertion.

leak_data object

Same as the leak object, here the id field only needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row. Note well: the leak_id field needs to be filled out in this case. You first have to create leak object and then afterwards the leak_data object.

{
  "id": 0,
  "leak_id": 0,
  "email": "[email protected]",
  "password": "string",
  "password_plain": "string",
  "password_hashed": "string",
  "hash_algo": "string",
  "ticket_id": "string",
  "email_verified": true,
  "password_verified_ok": true,
  "ip": "string",
  "domain": "string",
  "browser": "string",
  "malware_name": "string",
  "infected_machine": "string",
  "dg": "string"
}

import/csv/ endpoint

Also pretty self-explanatory. You need to first create a leak object, give it's ID as a GET-style parameter and upload the CSV in spycloud format via the Form.

Installation

  1. Install git and checkout this repository:
apt install git
git clone ...
cd credentialLeakDB
  1. Install Postgresql:
# in Ubuntu:
apt install postgresql-12           
# alternatively, if you are in Debian 10, you can also use postgresql-11, both work:
# apt install postgresql-11
  1. as user postgres:
sudo su - postgres
createdb credentialleakdb
createuser credentialleakdb
psql -c "ALTER ROLE credentialleakdb WITH PASSWORD '<insert some random password here>'" template1
  1. create the DB: psql -u credentialleakdb credentialleakdb < db.sql

  2. set the env vars:

export PORT=8080
export DBNAME=credentialleakdb
export DBUSER=credentialleakdb
export DBPASSWORD=... <insert the password you gave the user> ...
export DBHOST=localhost
  1. Create a virtual environment if it does not exist yet:
    virtualenv --python=python3.7 venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. start the program from the main directory:
export PYTHONPATH=$(pwd); uvicorn --reload --host 0.0.0.0 --port $PORT api.main:app

Configuration.

Please copy the file config.SAMPLE.py to api/config.py and adjust accordingly. Here you can set API keys etc.