How To Avoid Misuse Of Big Data And Prevent Data Chaos
As we know, every coin has two sides. While being an indisputably promising technology that is already changing the face of modern data analytics, Big Data is a very sensitive subject and its misuse may lead to a reverse result – data chaos instead of insights based business intelligence (BI).
Data analytics tools process huge volumes of information. The more unique your data, the higher the chance it’ll be compromised. Private data is the foundation of Big Data initiatives. However, it is vital to remember that this is not the only one disadvantage: there is a long list of less obvious problems, intricately connected with each other.
This is science, baby! (but, in fact, it is not)
Another problem is as the following: people consider a Big Data analysis to be science. But, in fact, analytical algorithms are far more similar to engineering than to science, it is not the same.
Try comparing physics and missiles. Physics is undoubtedly science, in which every hypothesis is studied and proved theoretically as well as practically. Only after this, the conclusions are exposed to scientific society, only because it is how it works.
Furthermore, science is always open: everyone can verify any law and any theory. As soon as someone reveals any serious flaw in calculations or exposes a new, exciting theory, it immediately becomes a focus of considerable discussion and all famous scientists will take part in it.
Missiles are only engineering structures designed using some specific physics’ knowledge. And as you definitely know, if there is any flaw in their design, it can cause serious problems and it regularly happens.
You can’t disagree with math, can you?
Having read the previous point, you can face the following consequence: false sense of the correctness of the conclusions reached by a computer. You can’t disagree with math calculations, can you?
Without knowing the math algorithm, it is impossible to impugn the correctness of calculations made. Theoretically, professional mathematicians could assess it, if they were given access to it. But are they able to do it? Most often not.
Black box is too black
Even if you have knowledge, experience and time and you are eager to spend it on checking how some algorithm works, you will not be able to do it. In most cases, the Big Data analysis technologies are a trade secret. Their source code is not available.
In her speech “Weapons of Math Destruction” Kate O’Neill, a mathematician and a human rights defender, told the audience how she tried to study a method of value-added modeling based on Big Data, which is widely used in USA.
“My friend, who owns a secondary school in New York, decided to study the algorithm. Her school is a specialized school with enhanced coverage of natural sciences and math, that’s why she was sure that she would cope with the algorithm. She sent an official request to the Department of Education and guess what was the answer? “Anyway, you will not be able to sort it out, it’s math!”
“She insisted on sending her the algorithm, at last she received a leaflet and showed it to me. The document was far too abstract to clear the situation. That’s why I sent a request, appealing to Freedom of Information Act, but I received a refusal to provide me with such information. Later I discovered that Wisconsin Research Centre in Madison developing this analytical model, signed a contract, according to which they do not have the right to open access and analyze the algorithm”.
“Nobody in New York Department of Education understands how this model works. Teachers do not understand how they are assessed and what should they do to improve the results. Nobody is either able to explain them or wants to do it.
What is vital and what is not?
The algorithm is not transparent; what is more, it is not clear what data are processed and what are not even paid any attention to. Besides, it is confusing not only for us but also for the operator working directly with a program, that draws the conclusions.
That is why the same data can influence a personal attitude twice: when they are entered into the program and when the operator makes a decision. What is more, some information can have no influence on the result, if the operator thought that it had already been used in the analysis but, in fact, the algorithm hadn’t done that.
For instance, imagine a police officer having got in an area with a high criminal rate. The algorithm warns that a man in front of him with a 55% probability is a robber. This man has a suspicious suitcase in his hand. But has the program taken into account this fact? There is a question: does the suitcase make the man more suspicious or not?
You should also bear in mind that default data may contain a mistake or they probably there is no information significantly needed for making the right decision.
The glass half full or half empty
The conclusions reached by the program are not fully transparent and they can be interpreted incorrectly. Different people will understand the same data differently. For example, is a 30% probability too much or too little? It depends on many factors, which we sometimes cannot even infer.
What is even worse, this percentage can be used in competitiveness. For example, even not high probability that some person may commit a crime will lead to his imprisonment but it could spoil his career opportunities in some establishments.
Similar algorithms are used in US state services to predict which applicant could give away a secret. Many applicants compete to get the post and some of them would get a refusal only because this probability for them is a little higher than average.
No prejudice?
All facts mentioned above allow us to assume that one of the most advertised advantages of Big Data, that is its impartiality, in fact does not work. The decision made by a human based on some calculations counted with the help of some algorithm, developed by other human, remains a human decision.
Some prejudice may have influenced this decision, or may not. The problem is that the algorithm is confidential, and without being sure about the default data, we can not be sure that the decision was unbiased. Moreover, we can not change anything: the whole course is strictly determined by a source code.
Welcome to the dark side, Anakin
One more drawback of prediction algorithm is prophecy that can come true in real life. For example, Chicago police use an algorithm to determine potentially dangerous teenagers.
Police officers decide to “supervise” such a teenager, they visit him at home, pay many other signs of attention, being “extremely polite” as usual. The teenager understands that he is already treated as a criminal, although he has not committed any crime yet and he begins to behave in an appropriate way. As a result, he can become a gang member.
Of course, the incorrect manner of police behavior can be a reason for that. But let us not forget that the algorithms give “scientific ground” for such actions.
Whitney Merrill in her speech “Predicting Crime in a Big Data World” at Chaos Communication Congress 32 noted: “A police officer is walking down the street and the algorithm predicts that he can meet a robber with a 70% probability. Will he arrest a robber only because he was supposed to find him in this area?”
Don’t want to take part? No way
If any governmental or commercial organization embeds some analytical software and you do not like it, you can’t simply say: “I’m fed up, I drop out”. Nobody will ask you if you agree to take part in such an experiment or not. Moreover, you are very unlikely to be told about it.
Don’t get the information wrong: I’m not telling you that all these drawbacks will make you refuse using high-tech analytical algorithms. Big Data technologies are still in the very beginning of their development, they are unlikely to disappear and they are to stay with us for long. Nonetheless, it is high time to consider all these problems, before it is far too late.
Today we need secure algorithms with transparent data processing mechanisms. It is necessary to allow independent scientists to analyze a source code; government should establish appropriate laws. Also people should know what mathematical algorithms are used in their everyday life by different establishments. And, of course, all parts involved in the process should learn from their own mistakes.