WOLFRAM NOTEBOOK

Working with Data

Goal of a Data Science Project

A data science project needs a flexible, modular, iterative and multiparadigm workflow.

A Multiparadigm Data Science Workflow

The (MPDS) approach enhances the workflow with:
  • Options to work with different types of data
  • The ability to branch out at every stage of the workflow and experiment with cross-disciplinary techniques
  • Access to a rich variety of algorithms driven by questions—not restricted to traditional methods associated with certain kinds of data or even a specific field of study
  • Different ways to communicate the final results of the analysis, depending on the audience
  • Incorporation of data processing, analysis and visualization capabilities all in one start-to-finish workflow
  • How Does the Workflow Help?

    Agile Data Science

  • Rapid prototyping with REPL (Read Evaluate Print Loop)
  • Iterative process of tweaking and improving
  • Modular approach to experimenting with and assembling a broad, flexible computational toolkit
  • Reproducible Research

    You can publish your analyses along with the data and code so that others can run similar analyses on different data or different analysis on the same data in order to:
  • Verify results (replication == stronger evidence)
  • Build on existing results
  • Combine results for better insight
  • The Wolfram Data Repository is built to be a global resource for public data and data-backed publication.

    Reproducible research checklist

  • A plan for structured data analysis
  • A modular pipeline for the analysis
  • Data wrangling
  • Data cleaning
  • Exploratory analysis
  • Data preprocessing
  • Modeling
  • Validation
  • Creating visualizations, reports, etc.
  • Automate (write code) wherever possible (avoid point and click)
  • Document your code
  • Use version control
  • Record and preserve
  • Sources: raw data, goals, references
  • Process: explorations, final code, observations and comments (selections and rejections)
  • Output: clean data, graphics, reports
  • Prepare for obsolescence—things will change, sources will get removed
  • Prepare for portability
  • Question

    The first stage of the workflow is where you frame questions. To get some useful conclusions from the data, you need to start out with the right questions.

    What Can You Learn from the Data?

    That is a pretty broad question. It makes sense to break it down into a few specific questions that can guide your analysis.

    Topic-specific questions

    Questions like:
  • How many?
  • Who?
  • Where?
  • What happened, together with?
  • Generic questions

    In addition to the topic-specific questions, it is helpful to keep in mind the audience for the analysis, right from the start.
  • Who is this analysis for?
  • What is the action that will be driven by the insight from this analysis?
  • How will they access the results? At what frequency?
  • The questions can be fuzzy as you start out, and they can change later. In fact, more interesting questions may surface as you sift through the data. However, it is important to set up questions at the beginning with the audience in mind. Otherwise, with the sheer variety of things that you can try with the data, you might end up wasting a lot of time trying unnecessary things.

    Data Wrangling

  • Process of importing raw data and converting it into a suitable format for downstream analysis
  • Sometimes requires “hacking skills” to organize and clean messy data into an informative, manageable dataset
  • Goal is to create code for semi-automated tools that would make the process easier the next time the workflow is used
  • Exploratory Data Analysis

    Exploratory data analysis (EDA) can help:
  • Gain an intuitive understanding of the underlying nature of the dataset
  • Identify relationships between variables
  • Formulate good questions for the actual analysis (as the explorations proceed, those questions can change)
  • Evaluate the quality of the data (Data QA)
  • Tools used in EDA can be categorized as:
  • Graphical or non-graphical
  • Univariate (exploring one feature/variable at a time) or multivariate (exploring combined behavior of more than one variable)
  • Analyze

    Machine Learning can Help

    Questions that can be answered by supervised machine learning:
  • Classification
  • Regression
  • Questions that can be answered by unsupervised machine learning:
  • How is the data organized?
  • Does the data have some inherent structure?
  • Do the samples sort themselves out into different groups and subgroups?
  • Is this unusual? Are there outliers in the data?
    Anomaly detection
  • What comes next?
    Sequence Prediction
    Time Series Forecasting
  • Communicate

    Multiple Channels of Communication

  • Visualizations and infographics
  • Computational essays or reports
  • Web deployed apps or microsites
  • Working with Data from a File

    Get the Data

    Set the working directory:
    Import the data file:
    Check the number of rows:
    Check the number of rows and the number of columns:
    Take a peek at the first few rows:

    Look at Samples and Features

    Rows in Tabular Data (Samples)

    Extract the first row:
    Extract the 45th row:
    Extract the last row:

    Columns in Tabular Data (Features)

    Extract the first column:
    Check if the number of rows in column 1 is the same as the number of rows in the entire dataset (it has to be):
    Extract the last column:

    Pick a Specific Row and Column

    You may want to look at the value of a specific feature for a particular sample.
    Get the value from the first row, second column:
    Get the value from the fifth row, second column:
    Get the value from the fifth row, third column:
    Get the values from the first five rows, last column:

    Compute Descriptive Statistics

    Mean of the first column:
    Mean of all the columns:
    Median of the third column:
    Median of all columns:
    Correlation, if any, between column 1 and column 2:
    Correlation, if any, between column 1 and column 3:
    Correlation between all four columns:
    Make the correlation matrix look better:
    Round the numbers in column 2 to nearest integers:
    Find the most common number:
    Count the number of samples with positive values and the number of samples with negative values in column 1:

    Visualize the Data

    Line Graphs

    Plot the values in column 1:

    Scatter Plot

    Create a scatter plot of the values in column 1 against the values in column 3:

    Bar Charts

    Visualize the means from each column in a bar chart:

    Histograms

    Create a histogram of the values in column 1:
    Create a histogram of the values in column 1 with bin size 1:

    Pie Charts

    Visualize the number of positive versus negative samples in column 1 in a pie chart:

    Scrape Data off a Webpage

    Import the entire webpage as a a single piece of text:
    Quickly visualize the page content, to see the words that appear most frequently on the page:
    Identify the adjectives on the page and count how many times each appears:
    Check the types of elements that can be imported from the webpage:
    Import the contents of the page in a structured list:
    Import only the images from the webpage:
    Import only the hyperlinks on the webpage:
    Import the webpage as an image:

    Access Built-in Data

    Natural Language Input

    The Wolfram Knowledgebase

    Entities

    The Wolfram Language makes it easier to create standardized computable description of real-world constructs and data. Free-form linguistics can be used to access the continuously updated Wolfram Knowledgebase (also used in Wolfram|Alpha).
    An Entity is a canonical object representing some real-world data.
    Currently available Entity Types can be listed with:
    A list of sample entities can be obtained by:
    An EntityClass represents a class of entities of a specific type. The type can be an implicitly defined entity or a representation of a table (or virtual table) in a database.
    A class of entities representing all the countries in North America:

    Properties of entities

    The available properties of an Entity can be viewed with:
    The value of a specific property is obtained by:
    You can also look up property metadata:

    Built-in datasets

    The guide on Knowledge Representation & Access provides more information on the Wolfram Language representations of entities and their properties.
    The Wolfram Language also supports custom entity stores that allow the same computations as the built-in knowledgebase and can be associated with external relational databases.
    Here is a list of datasets spanning a variety of topics:

    Find Curated Computable Data

    The Wolfram Data Repository

    Browse the Wolfram Data Repository website for computable datasets in various categories:

    Search for Datasets

    List resources matching a specific query:

    Build on Example Code from the Resource Page

    Import the dataset “Meteorite Landings” (a collection of known meteorite landings) from the Wolfram Data Repository:
    Copy the code for the visualization example on the Meteorite Landings page and evaluate it:
    Change the country to Mongolia:
    Instead of listing all the different classes, count the number of samples in each class and show it in reverse-sorted order:
    Visually represent the same information:

    References

  • Collection of functions to work with the Wolfram Data Repository and workflows that involve the Wolfram Data Repository
  • Wolfram Cloud

    You are using a browser not supported by the Wolfram Cloud

    Supported browsers include recent versions of Chrome, Edge, Firefox and Safari.


    I understand and wish to continue anyway »

    You are using a browser not supported by the Wolfram Cloud. Supported browsers include recent versions of Chrome, Edge, Firefox and Safari.