Working with Data

WOLFRAM NOTEBOOK

Working with Data

Goal of a Data Science Project

A data science project needs a flexible, modular, iterative and multiparadigm workflow.

A Multiparadigm Data Science Workflow

The

MultiparadigmDataScience

(MPDS) approach enhances the workflow with:



Options to work with different types of data



The ability to branch out at every stage of the workflow and experiment with cross-disciplinary techniques



Access to a rich variety of algorithms driven by questions—not restricted to traditional methods associated with certain kinds of data or even a specific field of study



Different ways to communicate the final results of the analysis, depending on the audience



Incorporation of data processing, analysis and visualization capabilities all in one start-to-finish workflow

How Does the Workflow Help?

Agile Data Science



Rapid prototyping with REPL (Read Evaluate Print Loop)



Iterative process of tweaking and improving



Modular approach to experimenting with and assembling a broad, flexible computational toolkit

Reproducible Research

You can publish your analyses along with the data and code so that others can run similar analyses on different data or different analysis on the same data in order to:



Verify results (replication == stronger evidence)



Build on existing results



Combine results for better insight

The Wolfram Data Repository is built to be a global resource for public data and data-backed publication.

Reproducible research checklist



A plan for structured data analysis



A modular pipeline for the analysis



Data wrangling



Data cleaning



Exploratory analysis



Data preprocessing



Modeling



Validation



Creating visualizations, reports, etc.



Automate (write code) wherever possible (avoid point and click)



Document your code



Use version control



Record and preserve



Sources: raw data, goals, references



Process: explorations, final code, observations and comments (selections and rejections)



Output: clean data, graphics, reports



Prepare for obsolescence—things will change, sources will get removed



Prepare for portability

Question

The first stage of the workflow is where you frame questions. To get some useful conclusions from the data, you need to start out with the right questions.

What Can You Learn from the Data?

That is a pretty broad question. It makes sense to break it down into a few specific questions that can guide your analysis.

Topic-specific questions

Questions like:



How many…?



Who…?



Where…?



What happened, together with…?

Generic questions

In addition to the topic-specific questions, it is helpful to keep in mind the audience for the analysis, right from the start.



Who is this analysis for?



What is the action that will be driven by the insight from this analysis?



How will they access the results? At what frequency?

The questions can be fuzzy as you start out, and they can change later. In fact, more interesting questions may surface as you sift through the data. However, it is important to set up questions at the beginning with the audience in mind. Otherwise, with the sheer variety of things that you can try with the data, you might end up wasting a lot of time trying unnecessary things.

Data Wrangling



Process of importing raw data and converting it into a suitable format for downstream analysis



Sometimes requires “hacking skills” to organize and clean messy data into an informative, manageable dataset



Goal is to create code for semi-automated tools that would make the process easier the next time the workflow is used

Exploratory Data Analysis

Exploratory data analysis (EDA) can help:



Gain an intuitive understanding of the underlying nature of the dataset



Identify relationships between variables



Formulate good questions for the actual analysis (as the explorations proceed, those questions can change)



Evaluate the quality of the data (Data QA)

Tools used in EDA can be categorized as:



Graphical or non-graphical



Univariate (exploring one feature/variable at a time) or multivariate (exploring combined behavior of more than one variable)

Analyze

Machine Learning can Help

Questions that can be answered by supervised machine learning:



Classification



Regression

Questions that can be answered by unsupervised machine learning:



How is the data organized?



Does the data have some inherent structure?



Do the samples sort themselves out into different groups and subgroups?



Is this unusual? Are there outliers in the data?
Anomaly detection



What comes next?
Sequence Prediction
Time Series Forecasting

Communicate

Multiple Channels of Communication



Visualizations and infographics



Computational essays or reports



Web deployed apps or microsites

Working with Data from a File

Get the Data

Set the working directory:

Import the data file:

Check the number of rows:

Check the number of rows and the number of columns:

Take a peek at the first few rows:

Look at Samples and Features

Rows in Tabular Data (Samples)

Extract the first row:

Extract the 45th row:

Extract the last row:

Columns in Tabular Data (Features)

Extract the first column:

Check if the number of rows in column 1 is the same as the number of rows in the entire dataset (it has to be):

Extract the last column:

Pick a Specific Row and Column

You may want to look at the value of a specific feature for a particular sample.

Get the value from the first row, second column:

Get the value from the fifth row, second column:

Get the value from the fifth row, third column:

Get the values from the first five rows, last column:

Compute Descriptive Statistics

Mean of the first column:

Mean of all the columns:

Median of the third column:

Median of all columns:

Correlation, if any, between column 1 and column 2:

Correlation, if any, between column 1 and column 3:

Correlation between all four columns:

Make the correlation matrix look better:

Round the numbers in column 2 to nearest integers:

Find the most common number:

Count the number of samples with positive values and the number of samples with negative values in column 1:

Visualize the Data

Line Graphs

Plot the values in column 1:

Scatter Plot

Create a scatter plot of the values in column 1 against the values in column 3:

Bar Charts

Visualize the means from each column in a bar chart:

Histograms

Create a histogram of the values in column 1:

Create a histogram of the values in column 1 with bin size 1:

Pie Charts

Visualize the number of positive versus negative samples in column 1 in a pie chart:

Scrape Data off a Webpage

Import the entire webpage as a a single piece of text:

Quickly visualize the page content, to see the words that appear most frequently on the page:

Identify the adjectives on the page and count how many times each appears:

Check the types of elements that can be imported from the webpage:

Import the contents of the page in a structured list:

Import only the images from the webpage:

Import only the hyperlinks on the webpage:

Import the webpage as an image:

Access Built-in Data

Natural Language Input

The Wolfram Knowledgebase

Entities

The Wolfram Language makes it easier to create standardized computable description of real-world constructs and data. Free-form linguistics can be used to access the continuously updated Wolfram Knowledgebase (also used in Wolfram|Alpha).

An Entity is a canonical object representing some real-world data.

Currently available Entity Types can be listed with:

A list of sample entities can be obtained by:

An EntityClass represents a class of entities of a specific type. The type can be an implicitly defined entity or a representation of a table (or virtual table) in a database.

A class of entities representing all the countries in North America:

Properties of entities

The available properties of an Entity can be viewed with:

The value of a specific property is obtained by:

You can also look up property metadata:

Built-in datasets

The guide on Knowledge Representation & Access provides more information on the Wolfram Language representations of entities and their properties.

The Wolfram Language also supports custom entity stores that allow the same computations as the built-in knowledgebase and can be associated with external relational databases.

Here is a list of datasets spanning a variety of topics:



Physics and Chemistry Data



Earth Sciences Data



Engineering Data, Transportation Data



Socioeconomic & Demographic Data



Life Sciences & Medicine Data

