Getting Molecular Properties through PUG-REST

Objectives

◼
  • Learn the basic approach to getting data from PubChem through PUG-REST.
  • ◼
  • Retrieve a single property of a single compound.
  • ◼
  • Retrieve a single property of multiple compounds.
  • ◼
  • Retrieve multiple properties of multiple compounds.
  • ◼
  • Use Table or Map to make the same kind of requests.
  • ◼
  • Process a large amount of data by splitting them into smaller chunks.
  • ◼
  • Use Mathematica’s ServiceExecute function to query PubChem.
  • ◼
  • Use Mathematica’s Molecule function to directly create molecules and query properties.
  • The Shortest Code to Get PubChem Data

    Let’s suppose that we want to get the molecular formula of water from PubChem through PUG-REST. You can get this data from your web browser via the following URL:
    ​https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt​
    ​
    Getting the same data using a computer program is not very difficult. This task can be achieved using a single line of code by providing the URL as a string to the URLExecute function; here we store the returned result value in the res variable:
    In[]:=
    res=URLExecute["https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt"]
    Out[]=
    H2O

    Libretext Reading:

    Review Section 1.6.2 REST Architecture before doing this assignment, and reference back to the Compound Properties Table as needed.

    Exercises:

    Exercise 1a: Retrieve the molecular weight of ethanol in a “text” format. 
    In[]:=
    (*writeyourcodeinthiscell*)
    Exercise 1b: Retrieve the number of hydrogen-bond acceptors of aspirin in a “text” format.
    In[]:=
    (*writeyourcodeinthiscell*)

    Formulating PUG-REST request URLs using variables

    In the previous examples, the PUG-REST request URLs were directly provided to the URLExecute, by explicitly typing the URL as the argument string. However, it is also possible to provide the URL using a variable. The following example shows how to formulate the PUG-REST request URL using variables, combining the pieces using the string concatenation operator '<>' and pass it to URLExecute:
    In[]:=
    pugrest="https://pubchem.ncbi.nlm.nih.gov/rest/pug";​​pugin="compound/name/water";​​pugoper="property/MolecularFormula";​​pugout="txt";​​​​url=pugrest<>"/"<>pugin<>"/"<>pugoper<>"/"<>pugout
    Out[]=
    https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt
    A PUG-REST request URL encodes three pieces of information (input, operation, output), preceded by the prologue common to all requests. In the above code cell, these pieces of information are stored in four different variables (pugrest, pugin, pugoper, pugout) and combined into a new variable url.
    ​
    The URLBuild function facilitates creating paths and queries for URLs:
    In[]:=
    url=URLBuild[{pugrest,pugin,pugoper,pugout}]
    Out[]=
    https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt
    Here, the strings stored in the four variables are enclosed in curly-brackets, ‘{}’ that groups them together as a List. The list is provided to URLBuild, which takes care of adding appropriate separator characters (in this case, ‘/’), as well as handling the conversion of special characters such as spaces and question marks to the URL standardized format.
    ​
    Then, the URL can be passed to URLExecute:
    In[]:=
    URLExecute[url]
    Out[]=
    H2O
    Alternatively, a common Mathematica programming idiom—especially for functions that take a single argument—is to compose a series of functions using the prefix operator, ‘@’:
    In[]:=
    URLExecute@URLBuild[{pugrest,pugin,pugoper,pugout}]
    Out[]=
    H2O
    This is merely a more concise way of writing the nested functions:
    In[]:=
    URLExecute[​​URLBuild[{pugrest,pugin,pugoper,pugout}]]
    Out[]=
    H2O
    Warning: Avoid beginning variables names with capital letters in Mathematica. All of Mathematica’s built-in functions and constants begin with capital letters, and defining a variable with this same name will result in a naming conflict.

    Making multiple requests

    The approach in the previous section (use variables to construct a request URL) looks very inconvenient, compared to the single-line code shown at the beginning, where the request URL is directly provided to URLExecute. If you are making only one request, it would be simpler to provide the URL directly to URLExecute, rather than assign the pieces to variables, constructing the URL from them, and passing it to the function.
    ​
    However, if you are making a large number of requests, it would be very time consuming to type the respective request URLs for all requests. In that case, you want to store common parts as variables and use them in some shared process. For example, suppose that you want to retrieve the SMILES strings of 5 chemicals. It is more convenient to store this collection of entries together in a list:
    In[]:=
    names={"cytosine","benzene","motrin","aspirin","zolpidem"}
    Out[]=
    {cytosine,benzene,motrin,aspirin,zolpidem}
    Now the chemical names are stored in a list called names. We want to perform the same process on each of the chemical names in the list. To do this, we will encapsulate the process by defining a function that takes one of the names as an input and returns an output. The input is an argument of the function. It is very common to want to define functions that to only apply to variables of a certain type. (In Mathematica terminology, this is the Head of the expression.) Although restricting input arguments to specific types is not strictly necessary, it can help avoid problems that involve mixing wrong types. For example, when defining a function to retrieve a chemical name, the color Blue,
    , cannot be a valid molecule name; rather only a String can be the valid type. We can define the function that restricts in the input argument to be a type String, in the following way:
    In[]:=
    retrieveName[myName_String]:=With[​​{pugrest="https://pubchem.ncbi.nlm.nih.gov/rest/pug",pugoper="property/CanonicalSMILES",​​pugout="txt",​​pugin="compound/name/"<>myName},​​URLExecute@URLBuild[{pugrest,pugin,pugoper,pugout}]]
    Next, we can construct a Table of results by looping over each chemical name and providing it to the retrieveName function we have defined; this returns a List of results.
    In[]:=
    Table[​​retrieveName[myName],​​{myName,names}]
    Out[]=
    {C1=C(NC(=O)N=C1)N,C1=CC=CC=C1,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,CC(=O)OC1=CC=CC=C1C(=O)O,CC1=CC=C(C=C1)C2=C(N3C=C(C=CC3=N2)C)CC(=O)N(C)C}
    A common functional programming idiom is to Map a function over each element in a list. This is specified as:
    In[]:=
    Map[retrieveName,names]
    Out[]=
    {C1=C(NC(=O)N=C1)N,C1=CC=CC=C1,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,CC(=O)OC1=CC=CC=C1C(=O)O,CC1=CC=C(C=C1)C2=C(N3C=C(C=CC3=N2)C)CC(=O)N(C)C}
    Another way is to specify the Map as an operator that can be applied to a variable:
    In[]:=
    Map[retrieveName]@names
    Out[]=
    {C1=C(NC(=O)N=C1)N,C1=CC=CC=C1,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,CC(=O)OC1=CC=CC=C1C(=O)O,CC1=CC=C(C=C1)C2=C(N3C=C(C=CC3=N2)C)CC(=O)N(C)C}
    Map is often abbreviated as the ‘/@’ operator:
    In[]:=
    retrieveName/@names
    Out[]=
    {C1=C(NC(=O)N=C1)N,C1=CC=CC=C1,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O,CC(=O)OC1=CC=CC=C1C(=O)O,CC1=CC=C(C=C1)C2=C(N3C=C(C=CC3=N2)C)CC(=O)N(C)C}
    Warning: When you make a lot of programmatic access requests using a loop, you should limit your request rate to or below five requests per second. Please read the following document to learn more about PubChem’s usage policies:
    ​https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access$_RequestVolumeLimitations​
    ​Violation of usage policies may result in the user being temporarily blocked from accessing PubChem (or NCBI) resources**
    It should be noted that the request volume limit can be lowered through the dynamic traffic control at times of excessive load (https://pubchemdocs.ncbi.nlm.nih.gov/dynamic-request-throttling). Throttling information is provided in the HTTP header response, indicating the system-load state and the per-user limits. Based on this throttling information, the user should moderate the speed at which requests are sent to PubChem. We will cover this topic later in the next section.

    Batching requests

    The example above has only five input chemical names to process, so it is not likely to violate the five-requests-per-second limit. However, if you have thousands of names to process, the above code will exceed the limit (considering that this kind of requests usually finish very quickly). Therefore, the request rate should be adjusted in some way. One approach is to insert a manual Pause:
    In[]:=
    names={"water","benzene","methanol","ethene","ethanol","propene","1-propanol","2-propanol","butadiene","1-butanol","2-butanol","tert-butanol"};
    The MapBatched function on the Wolfram Function Repository generalizes Map so that it conducts an action (default is to Pause for one second) every certain number of elements (default of 5); this is precisely the action we want to occur in our function. One can either refer to this function in a traditional form:
    In[]:=
    ResourceFunction["MapBatched"][retrieveName,names]
    Out[]=
    {O,C1=CC=CC=C1,CO,C=C,CCO,CC=C,CCCO,CC(C)O,C=CC=C,CCCCO,CCC(C)O,CC(C)(C)O}
    Or, alternatively, calling just the ResourceFunction returns a graphical representation:
    In[]:=
    ResourceFunction["MapBatched"]
    Out[]=
    [◼]
    MapBatched
    And this graphical representation can be used as a function, obtaining the same results as shown above:
    In[]:=
    [◼]
    MapBatched
    [retrieveName,names]
    Out[]=
    {O,C1=CC=CC=C1,CO,C=C,CCO,CC=C,CCCO,CC(C)O,C=CC=C,CCCCO,CCC(C)O,CC(C)(C)O}

    Exercises

    Exercise 3a: Retrieve the XlogP values of linear alkanes with 1 ~ 12 carbons.
    ◼
  • Use the chemical names as inputs
  • ◼
  • Define a function that retrieves the XlogP value for each molecule name from PubChem. Test this on a single molecule name.
  • ◼
  • Use the MapBatched function to apply your function to the entire list (to respect the limit of five requests per second)
  • In[]:=
    (*writeyourcodeinthiscell*)
    Exercise 3b Retrieve the isomeric SMILES of the 20 common amino acids.
    ◼
  • Use the chemical names as inputs. Because the 20 common amino acids in living organisms predominantly exist as one chiral form (the L-form), the names should be prefixed with “L-“ (e.g., “L-alanine”, rather than “alanine”), except for “glycine” (which does not have a chiral center).
  • ◼
  • Define a function to retrieve the isomeric SMILES string for a given name and test your function using a single name. Warning: The non-letter characters in isomeric SMILES strings can be mistaken for other types of data, which can result in an error message. To correctly handle this, it is necessary to specify an explicit format interpretation, using the URLExecute[url, params, format] style. In this case, you want the format of either URLExecute[url, {}, “Text”] or URLExecute[url, {}, “CSV”].
  • ◼
  • Use the MapBatched function to apply your function to the entire list
  • In[]:=
    (*writeyourcodeinthiscell*)

    Getting multiple molecular properties

    All the examples we have seen in this notebook retrieved a single molecular property for a single compound (although we were able to get a desired property for a group of compounds by performing a Map over a list). However, it is possible to get multiple properties for multiple compounds with a single request.
    ​
    The following example retrieves the hydrogen-bond donor count, hydrogen-bond acceptor count, XLogP, TPSA for 5 compounds (represented by PubChem Compound IDs (CIDs) in a comma-separated values (CSV) format:
    In[]:=
    pugrest="https://pubchem.ncbi.nlm.nih.gov/rest/pug";​​pugin="compound/cid/4485,4499,5026,5734,8082";​​pugoper="property/HBondDonorCount,HBondDonorCount,XLogP,TPSA";​​pugout="csv";​​​​URLExecute@URLBuild[{pugrest,pugin,pugoper,pugout}]
    This can be displayed in a more pretty form using TableForm:

    PubChem Service for Mathematica

    The work above demonstrates general strategies for accessing any API. However, Mathematica provides additional support for facilitating interactions with many of the most popular common web service APIs, such as for Twitter, Facebook, Dropbox, PubMed, and ... PubChem and ChemSpider. Behind the scenes, Mathematica is doing the same types of URL constructions described above, but it shields the user from having to worry about usage limits, authentication, and other details associated with using APIs.

    Basic usage

    Mathematica’s built-in PubChem service automatically handles property lookup. The “CompoundProperties” returns the complete list of properties for a molecule, specified in this case by its “Name”:
    ServiceExecute returns its result as a Dataset consisting of a set of named properties. To extract just the desired property of “CanonicalSMILES”, one can query the returned dataset:
    A Dataset can be converted to an ordinary list by using the Normal function:

    Exercises (revisited using ServiceExecute)

    Applying the PubChem Service to retrieve a list of names

    The PubChem Service allows us to perform queries for batches of entries without worrying about the data access limit:
    Again, the results are returned in the form of an easily browsable table. To extract a list of some particular column, we specify that we want All rows and the column with a specific name. Once again, Normal is used to convert this into an ordinary list of values:

    Exercises—Revisited

    ◼
  • Use the chemical names as inputs
  • ◼
  • Use ServiceExecute to retrieve the XlogP value for each molecule name from PubChem.
  • ◼
  • Use ServiceExecute to retrieve the isomeric SMILES string for a given name and test your function using a single name from PubChem.
  • Multiple names and multiple properties

    The PubChem service function can take a list of names as an argument, for example:
    We can extract a subset of the Dataset by providing a list of column names to extract:
    This Dataset contains named columns. The underlying data structure—exposed by its Normal form—is a list of Associations (also known as a “dictionary” in other programming languages). Each Association has a set of keys and associated values:
    The Values of the Association are:

    Exercises:

    Molecule functionality

    Mathematica 12’s Molecule function can parse most common chemical names without making use of PubChem. A Molecule can be defined by providing a common name, SMILES string, or InChI string as the argument:
    Many molecular properties can be then be retrieved using the MoleculeValue function:
    Alternatively, the same properties can be looked up by using the operator form MoleculeProperty function:
    Mathematica tends to use the full name for properties rather than abbreviations; for example, the PubChem “TPSA” corresponds to:
    Mathematica does not include all of the properties available on PubChem, such as the IsomericSMILES or XLogP values (although it does provide a different version of the logP calculation “CrippenClogP”. There are many different logP calculations available, compared and contrasted at https://sussexdrugdiscovery.wordpress.com/2015/02/03/not-all-logps-are-calculated-equal-clogp-and-other-short-stories/ a more detailed statistical comparison can be found in Mannhold et al (2008) https://doi.org/10.1002/jps.21494
    An advantages of using Molecule is that it facilitates visualization of molecular structures, both in two-dimensions:
    Molecules can also be visualized as interactive, rotatable 3-dimensional structures:

    Exercises (revisited using Molecule)

    ◼
  • Use Map, Molecule, and MoleculeProperty to retrieve the molecular weights each amino acid in the list.
  • Attributions

    Adapted from the corresponding OLCC 2019 Python Assignment:
    ​https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/1._Introduction/1.9%3 A_Python _Assignment _ 1