【数据治理】开源数据质量软件

下表列出了可用的开放源码数据质量软件发行版,涵盖了数据质量评估的某些方面。

 

纳入标准

  • 在其中一个存储库中可公开访问的任何开放源代码发行版。为简洁起见,当存储库包含许多不同的工具时,只提供一个链接
  • 库/框架不必只关注数据质量,因为功能经常与数据清理或探索性数据分析捆绑在一起。
  • 数据质量评估在广泛不同的环境/工作流程(从验证excel表到大数据管道,离线/在线等)中非常重要,因此该列表包含了不同的集合
  • star/issue/fork计数作为成熟度的粗略衡量标准。使用风险自负

开源数据质量软件

Open Source Data Quality Software
1. Name 2. Description 3. Language 4. Online Docs 5. URL 6. Stars 7. Issues 8. Forks

awslabs/

deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets Scala   github 1328 90 256

data-cleaning/

validate

validate: Data cleaning for statistical purposes R docs github 236 21 18

datacleaner/

DataCleaner

DataCleaner Community Edition Java docs github 371 172 136

daveoncode/

pyvaru

pyvaru: Rule based data validation library for python Python docs github 14 1 3

great-expectations/

great_expectations

Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling Python docs github 3127 147 348

OpenRefine/

OpenRefine

openRefine is a tool for working with messy data Java docs github 7735 595 1376

pandas-profiling/

pandas-profiling

pandas-profiling generates profile reports from a pandas DataFrame Python docs github 6338 44 962
pyeve/cerberus cerberus is a lightweight, extensible data validation library for Python Python docs github 2246 33 202

ResidentMario/

missingno

missingno is a missing data visualization module for Python Python   github 2540 15 334

WeBankFinTech/

Qualitis

Qualitis is a data quality management platform that supports quality verification, notification, and management for various datasources Java docs github 208 16 107

whylabs/

whylogs-python

whylogs-python is a Python implementation of whylogs Python docs github 191 10 7

 

原文:https://www.openriskmanual.org/wiki/Open_Source_Data_Quality_Software

本文:

讨论:请加入知识星球【超级工程师】,微信【it_training】或者QQ群【11107767】