Why Data Quality ​
Data quality = fit for use
Use:
- data will be used by customers in operations, decision making… (is the data qualified for this use.)
- data is fit for use if it is free of defects, possesses the features needed to complete the operation, decision, plan.
Fit for use:
- Free of defects
- Possesses the features needed to complete the operation.
- Exactly the right info, at the right place at the right time.
- đźš§ Data use needs domain and data understanding to be useful đźš§
What is data ​
Data
Consists of:
- Data Model
- Data Values
Data Model
Also known as the data schema. Indicates what are the possible columns or attributes of each data record. And also indicates the types of the attributes.
- not only the quality of the data but the quality of the documentation matters. (what do the attributes mean)
Properties of Data that complicate it’s quality ​
- data multiplies
- data are more complex than they appear
- data are subtle and nuanced. They have become the organization’s lingua franca
- data create more value when they are on the move
- data are organic
- data can be digitized
- data are the means by which organizations encode knowledge. They are meta-assets.
- Data are intangible
- Each organization’s data are uniquely it’s own.
Machine Learning ​
Data quality problems can influence the Training Set and the Test set.
Business Perspective ​
How to get the data right in the organization.
Data quality aspects:
- organizational
- architectural
- computational
Organizational ​
🚨 Data Governance
The collection of processes roles policies standards and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals.
- capture and understand (exploration and profiling)
- improve quality and security
- manage (data pipelines)
- control (review and monitor)
- document
- empower people (give people permission to ask for data quality)
Why do organizations not make much improvement on data quality ​
- the business case for data quality is not very good
- social political and structural issues.
- data have properties unlike other assets ,but organizations have yet to take into account.
- some managers do not believe in the benefits of good quality data
- data quality programs have not yet fully embraced the properties of data as a business asset lemma.
- Immature data markets
Management system for data should be covered by all management systems which makes it more complex.
approach: focus on preventing errors at their sources.
- focus on the most important needs of the most important customers (daily)
- apply relentless attention to process
- manage all critical sources of data
- measure quality at the source and in business terms (daily)
- employ control at all levels to halt simple errors and move forward (daily)
- develop a knack for continuous improvement (daily)
- set and achieve aggressive targets for improvement
- formalize management accountabilities for data (senior)
- a broad, senior group leads the effort (senior)
- recognize the hard issues are soft, and actively manage cultural change. (senior)
Cost of poor data quality ​
- higher operation costs
- lower customer satisfaction
- lower employee morale
- lost sales
- lower trust between departments
- poorer / delayed decisions
- increased risk of acceptance of new tech
- more difficult to manage overall risk
- more difficult to set and execute strategy
- fewer options to “put data to work”
- harder to align the org
- distracts management attention
- threat to competitive position
Data Quality Checklist ​
- people:
- how many are needed
- which skills
- how to manage and educate them
- job descriptions
- organization
- where should they work
- to whom should they report
- business model
- management
- accountability for everyone who touches data
- connect data creators and customers
- create feedback loops
- data
- how to share
- culture
Data Science Roles in Organizations ​
- Data Scientist: everyday data prep & analytics
- Data science specialist: complex data types, actions
- Data steward: governance (managing, securing data, supports users)
- Data curator: organizing, labelling, documenting
- Data controller (gdpr: entity determines the why and the how for processing data person)
- Chief Data Officer (mgmt team, enterprise wide gov, util of data as asset)
Data Warehouse Architecture ​
data quality dashboard
Show these measures on a dashboard for the data warehouse:
- null values
- repairs
- completeness
- freshness
Conclusion
- data is dirty always
- data is always dirtier than you think.
- minute DQ problems can have severe impacts.
- data cleaning needs domain and data understanding
- data use needs domain and data understanding
- a data scientist should know and tell you about the deficiencies in the data and the results
- data can even have been manipulated on purpose.