preview

Assignment 1

Better Essays

User’s perception of a dataset helps to determine the quality of the dataset and its reflects user needs. It is a good preparation to see how well a dataset is recommended by other users with regards to the quality of data. To characterize this, user metric has been categories into six criteria. 1. Downloads: Data scientists would prefer to download datasets that have higher downloads numbers assuming those datasets are higher quality in accuracy. For example, GeneCards web source have 147820 number of downloads, which confirms a higher level of quality of trust for its datasets. 2. Feed-back: Feed-back from other users will give a general judgement and satisfaction about the datasets. For example, GeneCards data sources have some …show more content…

Completeness is based on the Wand and Wang (1996), because it is unique in the quality literature for its theoretical approach to the definition of quality criteria. Their scope of the study is limited to the objective view of quality based on the stored data’s reliability to the external world. However, this serves as a basis for deriving Completeness for machine learning criteria in this thesis. 1. Completeness: Good representation of real-world by a data source requires that the data is complete. For example, size of tumor cell attribute has no empty fields. Completeness can be derived into two sub-criteria of data quality. • Missing values: It is a common technique in machine learning process to replace the missing values with the mean value of that attribute or remove the missing values depends on the proportion of the missing value to the total number of records. This is not appropriate when there is a significant percentage of missing values which could lead to biased results. • NULL Values: NULL value described by Redman (1997), not applicable and none or applicable but unknown or applicability unknown. Nulls in the datasets could potentially ambiguous unless their meaning is clearly defined. 2. Correctness: Describes how meaningful and unambiguous the given data. Correctness further classified into two data quality sub-criteria. • Cardinality:

Get Access