Why is everyone talking about big data? What kind of data is considered large? Where to find them, why are they needed, how to make money on them?
What is Big Data
Big Data (big data) – huge datasets of diverse data. They are huge, because their volumes are such that a simple computer cannot cope with their processing, but varied – because this data is of a different format, unstructured and contains errors. Big data accumulates quickly and is used for many different purposes.
Big Data is not an ordinary database, even if it is very large. Here are the differences:
|Not big data||Big Data|
| Database of records of thousands of employees of the corporation. Information in such a database has predetermined characteristics and properties; it can be presented in the form of a table, as in Excel.|
The name, age and marital status of all of Facebook’s 2.5 billion users is just a very large database.
Archive of records of city video surveillance cameras.
Employee actions log. For example, all the data that a call center creates during operation, where 500 people work
Clicks on links, messages sent and received, likes and reposts, mouse movements or taps on the smartphone screens of all Facebook users.
Video recording system data of traffic violations with information about the traffic situation and license plates of violators; information about metro passengers, obtained using the face recognition system, and who of them is on the wanted list.
The amount of information in the world is increasing every second, and what was considered big data a decade ago now fits on the hard drive of a home computer.
60 years ago, a 5 megabyte hard drive was twice the size of a refrigerator and weighed about a ton. A modern hard drive in any computer holds up to one and a half terabytes (1 terabyte is equal to 1 million megabytes) and is smaller in size than a regular book.
In 2021, big data is measured in petabytes. One petabyte equals one million gigabytes. A three-hour 4K film “weighs” 60-90 gigabytes, and the entire YouTube is 5 petabytes, or 67,000 such films. 1 million petabytes is 1 zettabyte.
How does Big Data technology work?
Big data collection sources are divided into three types:
Everything a person does online is a source of social big data. Every second, users upload 1,000 photos to Instagram and send over 3 million emails. Every second personal contribution of each person is an average of 1.7 megabytes.
Other examples of social Big Data sources are statistics of countries and cities, data on the movement of people, registration of deaths and births, and medical records.
Big data is also generated by machines, sensors and the Internet of Things. Information is received from smartphones, smart speakers, light bulbs and smart home systems, video cameras on the streets, weather satellites.
Transactional data arises from purchases, money transfers, goods deliveries, and ATM transactions.
How is big data processed?
- Big Data arrays are so large that simple Excel cannot handle them. Therefore, special software is used to work with them.
- It is called “horizontally scalable” because it distributes tasks across multiple computers simultaneously processing information. The more machines are involved in the work, the higher the productivity of the process.
- Such software is based on MapReduce, a parallel computing model. The model works like this:
- first, the data is filtered according to the conditions set by the researcher, sorted and distributed between individual computers (nodes);
- then the nodes calculate their data blocks in parallel and pass the calculation result to the next iteration.
MapReduce is not a specific program, but rather an algorithm that can be used to solve most of the problems of big data processing.
Examples of software that relies on MapReduce:
- Hadoop is an open source suite of software for file storage, scheduling, and data collaboration. The system is designed so that if one node fails, the load is immediately redistributed to others without interrupting the calculations.
- Apache Spark is a set of libraries that allow you to perform calculations in memory and repeatedly access the results of calculations. It is used to solve a wide range of tasks, from simple data processing and filtering to machine learning.
Big data scientists use both tools: Hadoop for building data infrastructure and Spark for processing streaming information in real time.
Where is Big Data Analytics Used?
Big data is needed in marketing, transportation, automotive, healthcare, science, agriculture and other spheres in which it is possible to collect and process the necessary data sets.
Businesses need big data to:
- Optimize processes – for example, large banks use big data to train a chatbot – a program that will replace a live employee on simple issues and, if necessary, switch to a specialist.
- Make Predictions – By analyzing big sales data, companies can predict customer behavior and consumer demand for products based on the time of year or the situation in the world.
- Build Models – By analyzing profit and cost data, a company can build a model to predict revenue.
Big data analysis allows not only to systematize information, but also to find non-obvious cause-and-effect relationships.
Sales of goods
Online marketplace Amazon has launched a product recommendation system powered by machine learning. It takes into account not only the behavior and previous purchases of the user, but also the season, upcoming holidays and other factors. After this system worked, recommendations began to generate 35% of all service sales.
Lenta supermarkets use big data to analyze information about purchases and offer personalized discounts on goods. For example, they say in the company, a system based on purchase data can understand that a customer has changed his approach to eating, and will begin to offer him suitable products.
The American network Kroger uses big data to personalize coupons that shoppers receive via email. After they were made individual, suitable for specific buyers, the share of purchases only for them increased from 3.7 to 70%.
Large companies, including Russian ones, began to resort to the help of recruiting robots in order to weed out those who are not interested in the vacancy or do not fit it at the initial stage of the search for an employee. So, the Stafory company has developed a robot Vera, which sorts resumes, makes initial calls and selects interested candidates. PepsiCo filled 10% of the vacancies with just a robot.
Banks are actively using big data. For example, they help protect customers from fraud. It is with the help of these technologies that anomalies in user behavior, atypical purchases or transfers, are detected. Already in 2017, Visa used data analysis to prevent $ 2 billion of fraud annually .
In 2020, Toyota faced a problem: it was necessary to understand the reason for the large number of accidents caused by drivers who confused the gas and brake pedals. The company collected data from its internet-connected cars and based it on how people pedal.
It turned out that the force and speed of pressure differ depending on whether the person wants to slow down or accelerate. Now the company is developing a system that will determine the manner of pressure on the pedals while driving and will slow down the car if the driver presses on the gas pedal, but does it as if he wants to brake.
American scientists have learned using big data to determine how depression spreads. Researcher Moonmun De Chaudhury and her colleagues uploaded geotagged messages from Twitter, Facebook, and Reddit to the predictive model. Messages were selected for words that could indicate depressive and depressed states. The calculations coincided with the official data.
Big data is a must for government agencies. With their help, not only statistics are carried out, but also the surveillance of citizens. There are similar systems in many countries: the PRISM service is known , which are used by the FBI and the CIA to collect personal data from social networks and products of Microsoft, Google and Apple. In Russia, information about users and phone calls is collected by the SORM system.
Social big data helps group users by interests and personalize ads for them. People are ranked by age, gender, interests, and place of residence. Those who live in the same region, visit the same places, watch videos and read articles on similar topics are likely to be interested in the same products.
At the same time, scandals related to the use of big data in marketing regularly occur. So, in 2018, the streaming platform Netflix was accused of racism due to the fact that it shows users different movie and TV series posters depending on their gender and nationality.
With the help of big data analysis, the media measure the audience. In this case, Big Data may even affect the editorial policy. Thus, the Huffington Post uses a system that shows statistics of visits, comments and other user actions in real time, as well as prepares analytical reports.
The system at the Huffington Post evaluates how effectively headlines grab the reader’s attention and develops methods for delivering content to specific categories of users. For example, it turned out that parents are more likely to read articles from their smartphones and late at night on weekdays, after they put their children to bed, and on weekends they are usually busy – as a result, content for parents is published on the site at a convenient time for them.
The use of big data helps to optimize transportation, make delivery faster and cheaper. At DHL, big data has tackled the so-called last mile problem, where the need to drive through yards and find parking before placing an order eats up a total of 28% of shipping costs. The company began to analyze the “last miles” using information from GPS and data on traffic conditions. As a result, it was possible to reduce fuel costs and delivery time.
Inside the company, big data helps to track the quality of employees’ work, adherence to deadlines, and the correctness of their actions. For analysis, machine data is used, for example, from parcel scanners in branches, and social data – reviews of branch visitors in the application, on websites and in social networks.
Until 2016, there was no neural network technology on mobile devices, it was even considered impossible. A breakthrough in this area (thanks also to the Russian startup Prisma) allows us today to use a huge number of filters, styles and different effects on photos and videos.
Airbnb has used Big Data to change user behavior. Once it turned out that visitors to a property rental site from Asia were leaving it too quickly and not returning. It turned out that they go from the main page to “Places nearby” and go to look at the photos without further booking.
The company analyzed user behavior in detail and replaced links in the Nearby section with the most popular travel destinations in Asian countries. As a result, conversion to bookings from this part of the planet increased by 10%.
Who works with big data?
The three main professions in Big Data are: Data Engineer , Data Scientist , Data Analyst .
Data Scientists specialize in Big Data analysis. They look for patterns, build models and predict future events based on them.
For example, a big data researcher can use statistics on ATM withdrawals to develop a mathematical model to predict the demand for cash. This system will tell the collectors how much money and when to bring it to a particular ATM.
To master this profession, you need an understanding of the basics of calculus and knowledge of programming languages such as Python or R, as well as the ability to work with SQL databases.
The data analyst uses the same set of tools as the data scientist, but for different purposes. Its tasks are to do descriptive analysis, interpret and present data in an easy-to-read form. It processes data and produces results, making analytical reports, statistics and forecasts.
Other specialists also work with Big Data, for whom this is not the main field of work:
- interface designers analyzing behavioral research data to create user interfaces;
- NLP engineers who develop programs for chatbots and call center automation by analyzing natural language;
- analyst marketers who study data sets to build marketing policies and personalize ads;
- engineers and programmers in data processing enterprises.
The data engineer deals with the technical side of the issue and the first one works with information: organizes its collection, storage and initial processing.
Data engineers help researchers create software and algorithms to automate tasks. Without such tools, big data would be useless, since their volumes cannot be processed. Knowledge of Python and SQL is important for this profession, as well as being able to work with frameworks such as Spark.
Alexander Kondrashkin about other professions in which Big Data may be needed: “Somewhere a product manager may go to a Hadoop cluster himself and calculate something simple, if he has such skills. There are probably plenty of backend developers and DevOps engineers out there setting up storing and collecting data from users. “
Demand for big data and specialists in it
The demand for big data is growing: according to studies in 2020, even under a pessimistic scenario, the size of the Big Data market in Russia by 2024 will grow from 45 billion to 65 billion rubles, and with a good development of events – up to 230 billion.
Companies are increasingly turning to big data analytics because those who don’t see lost profits: The Bell cites the example of Caterpillar. In 2014, its distributors lost from $ 9 to $ 18 billion annually in profit just because they did not implement Big Data processing technologies. The company’s 3.5 million units are now equipped with sensors that collect information about the condition and wear of key parts, which helps to better manage maintenance costs.
Along with the popularity of big data, there is a growing demand for those who can effectively work with it. In mid-2020, the MADE Big Data Academy from Mail.ru Group and HeadHunter conducted a study and found out that data analysts are already among the most in-demand on the labor market in Russia. In four years, the number of vacancies in this area has increased almost 10 times.
More than a third of vacancies for data scientists (38%) are in IT companies, the financial sector (29%) and the business services industry (9%). In the field of machine learning, IT companies publish 55% of vacancies in the market, 10% come from the financial sector and 9% from the services sector.