2. Select the Right Data for Exploration

As a data analyst, you must decide what data to collect and use for every project. With a nearly endless amount of data out there, this can be quite a bit of a data dilemma, but there’s good news. In this lecture, you’ll learn which factors to consider when collecting data. Usually, you’ll have a head start in figuring out the right data for the job because the data you need will be given to you, or your business task or problem will narrow your choices. Following are some data-collection considerations to keep in mind for your analysis:

How the data will be collected

Let’s start with a question like, what’s causing increased rush hour traffic in your city? 

First, you need to know how the data will be collected. You might use observations of traffic patterns to count the number of cars on city streets within a particular time frame. You notice that cars are getting backed up on a specific street. That brings us to data sources. In our traffic example, your observations would be first-party data. 

This is data collected by an individual or group using their own resources. Collecting first-party data is typically the preferred method because you know exactly where it originated. You might also have second-party data, which is data collected by a group 

directly from its audience and then sold. Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.

Data sources

In our example, if you can’t collect your own data, you might buy it from an organization that’s led traffic pattern studies in your city. This data didn’t start with you, but it’s still reliable because it came from a source with experience in traffic analysis. The same can’t always be said about third-party data or data collected from outside sources who did not collect it directly. This data might have come from several different sources before you investigated it. It might not be as reliable, but that doesn’t mean it can’t be useful. You’ll want to make sure you check it for accuracy, bias, and credibility. No matter the data you use, it must be inspected for accuracy and trustworthiness. We’ll learn more about that process later. 

For now, remember that the data you choose should apply to your needs, and it must be approved for use. As a data analyst, it’s your job to decide what data to use, and that means choosing the data that can help you find answers and solve problems without getting distracted by other data. In our traffic example, financial data probably wouldn’t be that helpful, but existing data about high-volume traffic times would be. 

If you don’t collect the data using your resources, you might get data from second-party or third-party providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect it. Third-party data might come from many different sources.

Solving your business problem

Datasets can show a lot of interesting information. But be sure to choose data that can help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.

How much data to collect

If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs. 

In data analytics, a population refers to all possible data values in a certain data set. If you’re analyzing data about car traffic in a city, your population would be all the cars in that area. However, collecting data from the entire population can be challenging. That’s why a sample can be useful. 

A sample is a part of a population representative of the population. You might collect a sample of data about one spot in the city and analyze the traffic there, or you might pull a random sample from all existing data on the population. How you choose your sample will depend on your project. As you collect data, you’ll also want to make sure you select the right data type. An appropriate data type for traffic data could be the dates of traffic records stored in a date format. The dates could help you figure out what days of the week there is likely to be a high volume of traffic in the future. We’ll explore this topic in more detail soon. 

Time frame

If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use existing historical data. Finally, you need to determine the time frame for data collection. In our example, if you needed an answer immediately, you’d have to use existing historical data. But let’s say you needed to track traffic patterns over a long period of time. 

That might affect the other decisions you make during data collection. Now you know more about the different data collection considerations you’ll use as a data analyst. 

Because of that, you’ll be able to find the right data when you start collecting it yourself. 

There’s still more to learn about data collection, so stay tuned.

Use the flowchart below if data collection relies heavily on how much time you have:

Post it on social media
Dr Nabeela: