Dataset Marketplace

Dataset Access

Accessing the datasets requires an active AWS account with credentials issued by Trustfull. These credentials include an AWS Access Key and Secret Key, which will be provided separately via email and SMS.

These credentials are required to authenticate with AWS S3 and download the datasets. To access the datasets, you need:

AWS CLI installed

The AWS Command Line Interface (CLI) must be installed to interact with S3. It can be downloaded and installed from the official AWS documentation. Once installed, AWS CLI needs to be configured by entering the provided credentials and specifying the default region, which is typically set to a predefined region unless otherwise specified.

Download and install AWS CLI from AWS CLI Installation Guide & Verify the installation by running:

aws --version

AWS Credentials Configuration

Once you have received your credentials, configure AWS CLI:

aws configure

You will be prompted to enter:

  • AWS Access Key ID: (Received via email)
  • AWS Secret Access Key: (Received via SMS)
  • Default region name: (Leave blank or set your preferred AWS region)
  • Default output format: (Leave blank or set json)

Accessing the Datasets on S3

Each dataset is stored in a specific prefix inside the S3 bucket. This prefix corresponds to the customer ID assigned to you. For example, if your customer ID is 123456, your datasets will be stored in:

s3://your-dataset-bucket/123456/

Accessing Datasets from Unix or Windows (powershell)

Once AWS CLI is configured, you can list (command ls) the datasets under your prefix:

$ aws s3 ls s3://your-dataset-bucket/123456/

To download a specific dataset (cp command):

$ aws s3 cp s3://your-dataset-bucket/123456/dataset-file.csv .

To download all datasets under your customer ID (sync command):

$ aws s3 sync s3://your-dataset-bucket/123456/ .

Accessing Datasets with Python

You can also access the datasets using the boto3 library in Python.

pip install boto3

Example: Listing and Downloading Files

import boto3

# Configure AWS access
aws_access_key = "your-access-key"
aws_secret_key = "your-secret-key"
bucket_name = "your-dataset-bucket"
customer_id = "123456"  # Your assigned customer ID

# Initialize S3 client
s3_client = boto3.client(
    "s3",
    aws_access_key_id=aws_access_key,
    aws_secret_access_key=aws_secret_key
)

# List files in the S3 bucket under the customer's prefix
prefix = f"{customer_id}/"
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

if "Contents" in response:
    print("Available datasets:")
    for obj in response["Contents"]:
        print(obj["Key"])

    # Download a specific file
    file_name = "dataset-file.csv"
    s3_client.download_file(bucket_name, f"{prefix}{file_name}", file_name)
    print(f"Downloaded {file_name}")

else:
    print("No files found in the bucket for this customer ID.")

You can also visit our recipes section for an additional tutorial on how to access the dataset using Python

Dataset Format and Versioning

The datasets are provided in CSV (Comma-Separated Values) format and each new update replaces the previous version of the same file. However, S3 versioning allows retrieving previous versions of the dataset if needed.

Retrieving Previous Versions of a Dataset:

aws s3api list-object-versions --bucket your-dataset-bucket --prefix 123456/dataset.csv

Example response:

{
    "Versions": [
        {
            "Key": "123456/dataset.csv",
            "VersionId": "yd9E3xaM8lF_Example1",
            "LastModified": "2024-04-01T10:00:00.000Z",
            "IsLatest": false
        },
        {
            "Key": "123456/dataset.csv",
            "VersionId": "5FG8XaTzD_Example2",
            "LastModified": "2024-03-01T10:00:00.000Z",
            "IsLatest": true
        }
    ]
}
  • The VersionId field identifies a specific version of the dataset.
  • The IsLatest: true entry corresponds to the current dataset version.
  • Older versions are listed with their LastModified timestamps.

Download a Specific Version of the Dataset

To retrieve a specific version of the dataset, use the VersionId from the previous command:

aws s3api get-object --bucket your-dataset-bucket --key 123456/dataset.csv --version-id yd9E3xaM8lF_Example1 dataset_old.csv

This command downloads an older version of dataset.csv and saves it as dataset_old.csv.

Magining updates

Trustfull updates datasets on a regular basis, typically on a monthly schedule, unless otherwise specified in the contract. Customers can monitor for new updates by implementing a polling mechanism that checks for the latest dataset version in the S3 bucket.

Polling involves listing the files within the designated S3 bucket prefix and comparing timestamps to determine whether a newer version is available. If a contract specifies a different update frequency, the polling interval should be adjusted accordingly to match the expected dataset refresh rate.

For customers using automated data pipelines, programmatic access via SDKs such as Boto3 (Python) allows for efficient retrieval and integration of updated datasets. Below is a Python example of how to poll for the latest dataset update using Boto3:

import boto3
from datetime import datetime, timezone

# AWS S3 Configuration
bucket_name = "trustfull-datasets"
customer_id = "123456"
dataset_prefix = f"{customer_id}/"

# Initialize S3 Client
s3_client = boto3.client("s3")

# Function to check for the latest dataset
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=dataset_prefix)

if "Contents" in response:
    latest_file = max(response["Contents"], key=lambda x: x["LastModified"])
    last_modified = latest_file["LastModified"].astimezone(timezone.utc)
    print(f"Latest dataset: {latest_file['Key']} updated at {last_modified}")
else:
    print("No datasets available.")

This script retrieves the list of available datasets in the assigned S3 prefix, identifies the most recently updated file, and prints its timestamp. Customers can schedule this script to run at periodic intervals to automatically detect new dataset updates. Customers who require up-to-date data can implement a polling mechanism to check for new dataset versions at the frequency that best suits their needs.

Available Datasets

Trustfull provides multiple datasets to support fraud detection, identity verification, and risk assessment. Below is a list of currently available datasets:

  • Disposable Phones Contains information on phone numbers associated with disposable, temporary, or virtual providers.

Troubleshooting & FAQ

AWS CLI Command Not Found:

Ensure AWS CLI is installed by running. If not installed, follow the AWS CLI Installation Guide.

aws --version

Access Denied When Running AWS Commands:

Verify that your credentials are correctly configured, ensure your credentials are still valid.

aws configure list

Missing Files in the Bucket

Check if the bucket name and customer ID are correct. For further support, contact your account manager.

aws s3 ls s3://your-dataset-bucket/123456/