
Product
Introducing Data Exports
Export Socket alert data to your own cloud storage in JSON, CSV, or Parquet, with flexible snapshot or incremental delivery.
datavalid
Advanced tools
This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.
pip install datavalid
Create a datavalid.yml file in your data folder:
files:
fuse/complaint.csv:
schema:
uid:
description: >
accused officer's unique identifier. This references the `uid` column in personnel.csv
tracking_number:
description: >
complaint tracking number from the agency the data originate from
complaint_uid:
description: >
complaint unique identifier
unique: true
no_na: true
validation_tasks:
- name: "`complaint_uid`, `allegation` and `uid` should be unique together"
unique:
- complaint_uid
- uid
- allegation
- name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
empty:
and:
- column: allegation_finding
op: equal
value: sustained
- column: disposition
op: not_equal
value: sustained
fuse/event.csv:
schema:
event_uid:
description: >
unique identifier for each event
unique: true
no_na: true
kind:
options:
- officer_level_1_cert
- officer_pc_12_qualification
- officer_rank
validation_tasks:
- name: no officer with more than 1 left date in a calendar month
where:
column: kind
op: equal
value: officer_left
group_by: uid
no_more_than_once_per_30_days:
date_from:
year_column: year
month_column: month
day_column: day
save_bad_rows_to: invalid_rows.csv
Then run datavalid command in that folder:
python -m datavalid
You can also specify a data folder that isn't the current working directory:
python -m datavalid --dir my_data_folder
A config file is a file named datavalid.yml and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains config object in YAML format.
float: true.Common fields:
Checker fields (define exactly one of these fields):
There are 3 ways to define a condition. The first way is to provide column, op and value:
The second way is to provide and field:
Finally the last way is to provide or field:
and except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.Combines multiple columns to create dates.
FAQs
Data validation library
We found that datavalid demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Export Socket alert data to your own cloud storage in JSON, CSV, or Parquet, with flexible snapshot or incremental delivery.

Research
/Security News
Bitwarden CLI 2026.4.0 was compromised in the Checkmarx supply chain campaign after attackers abused a GitHub Action in Bitwarden’s CI/CD pipeline.

Research
/Security News
Docker and Socket have uncovered malicious Checkmarx KICS images and suspicious code extension releases in a broader supply chain compromise.