We receive people claim data (time sheet) data merged with employee /consultant records worjibg on different oroject on weekly basis.based on this data we generate weekly actuals hours and next week forecasts etc.
Is there any mathematical way to understand if the data we ate receiving is good quality data.
In recent past we found some issues like consultant data is missing country name or incorrect claims. Or sometimes it shows wrong rates based on country..
I am looking for some formulae or method to judge whether the data we are using is fit for decision making. And how good it is for decision making in terms of its accuracy.
Have you looked at developing a regression model.
The ultimate objective of a multiple regression analysis is to develop a model that will accurately predict a dependent variable (y) as a function of a set of independent variables (x1, x2, x3, x4…….xk).
This will show which independent variables have most impact on your output variable, so it will determine how important country name is to determine to determining your weekly forecast.
The only way to run quality checks for the kind of errors you are describing is to bean count the data. You might want to see if there are differences in the frequencies of kinds of mistakes (i.e. wrong country has 5 times the count of wrong claim, within wrong claims the most frequent incorrect claim is A vs. all others, etc.). If you find patterns of this type you could look into possible causes and perhaps reduce error rates by eliminating the causes. You could also focus on the subset of variables you are currently using for your analysis but you should also recognize that limiting the focus in this manner will probably cause trouble later when someone decides to run an analysis using the uninspected data fields.
Yes, use a sample but think of where your resources are so you can get a representative sample. Then figure out how to verify inputted data vs actual data and find your % correct.
This is why we do measurement system analysis and evaluation. Look into MSA/MSE tools and techniques to find ways to put a number on how much trust you can place on your data. A rule of thumb is that you want at least 95% confidence. If you aren’t close to that you need a project to improve the measurement system before using the data to make decisions.
can you please point me to simple easy to understand example.
I do not think this approach will be successful as we have huge data and error can occur in any column anywhere.
Hence the reason for suggesting taking a random sample and bean counting the errors. While true that an error can occur anywhere it is very unlikely that the distribution of errors is uniform. A bean count sample will give you a sense of error frequency as a function of entry type and that, in turn, will provide guidance with respect to such things as prioritization of error correction and prioritization of efforts with respect to data cleaning.
1 . Take a block of data – you decide how big but it should cover some meaningful period of time – again this will be a judgment call on your part.
2. Assign a number to each record
3. Take a random number generator and have it generate a random sequence of numbers equal to the total count of your sample data set.
4. Pull the records whose numbers match some percent of the total random numbers (i.e. if you have 10,000 records take the thousand records whose assigned number matches the first 1000 numbers in the random list).
5. Examine the specific data fields for each record and count the number of times an error occurs. You will have to decide what constitutes an error.
6. Summarize the frequencies of errors for each field and pareto chart the results. Chances are good you will find a few fields with much larger error rates than others.
6. Those fields then become the focus of your investigation with respect to why they occur and what could be done to eliminate/reduce the error rate.
@rajaniesh – @rbutler has provided you some great advice. Another thing you should be doing is finding ways to error-proof your data entry. For example, instead of free-form data entry can you convert to a pull-down set of answers? Countries for example can be standardized so you don’t get “USA”, “America”, and “US” as various answers. For number entry, you should be checking the values as they are entered (if time is being entered for a week, is the value in a reasonable range?). And lastly, what is the value of the data being requested? I’ve found that much is asked, and little is used. Most of the people realize that this is not used and put little effort into entering it correctly. Ask only for what is truly required, and use it. And let the people who are providing the data know you are using it, and for what it is being used. They will help to improve the data accuracy if they feel it is worth it to them.