Administrative Tasks and Troubleshooting
Administrative Tasks and Troubleshooting
Updating Wolfram Enterprise Private Cloud (EPC)
Updating Wolfram Enterprise Private Cloud (EPC)
User Management
User Management
User management happens via the Java-based user API. With the advent of Wolfram Cloud Version 1.50, this API can also be used from the command line. CRUD operations will be documented here when they are available.
◼
For details on the user API, contact Technical Support
Monitoring a Cluster
Monitoring a Cluster
If you are running a cluster, a load balancer will distribute user sessions among the compute nodes configured in the ClusterNodeInformation property. After at least one compute node is ready to serve requests and all the other services are ready, your cluster is ready. A user can determine which host their session is running on by evaluating $MachineName. An administrator can view all active sessions in the cluster by visiting the session monitor utility located at http://{computenodehostname}:8080/utilities/session-monitor.jsp, where {computenodehostname} is the hostname for one of the compute nodes in your cluster.
Monitoring: Logging
Monitoring: Logging
Logged data useful for monitoring the function of EPC is recorded in common system and application files, along with files unique to EPC. Relevant logs are stored in three different places:
◼
/var/log/ for standard system logs
◼
/www/tomcat/base/current/logs/ for instantiation events and uncaught application exceptions
◼
/www/tomcat/logs/wolfram/ for cloud platform logs
The standard system logs will be familiar to most Linux administrators, e.g. the file /var/log/demsg.log contains information collected during the boot process of the virtual machine. Detailed coverage of these standard logs is outside the scope of this document, but the interested reader is encouraged to consult the Linux Information Project for more information. As EPC is an application run within the Apache Tomcat framework, its log files (along with standard log files generated by Tomcat) can be found in the directory /www/tomcat/. The following table summarizes the typical content of log files within /www/:
Log File | Contents |
base/current/logs/catalina.out | Apache Tomcat startup and stdout/stderr messages |
base/current/logs/localhost.* | uncaught Java exceptions |
logs/wolfram/CloudExpressionStore.log | Redis connection events |
logs/wolfram/CloudPlatform.log | default cloud platform log file |
logs/wolfram/CloudStore.log | file and cloud object metadata retrieval messages |
logs/wolfram/EditEvents.log | notebook retrieval error messages |
logs/wolfram/EvaluationEvents.log | Wolfram Language evaluation events |
logs/wolfram/Graphics.log | error messages pertaining to graphics generation |
logs/wolfram/HttpRequestTimings.log | individual HTTP access request timings |
logs/wolfram/KernelLifeCycle.log | messages pertaining to the instantiation and destruction of Wolfram kernel processes |
logs/wolfram/MonitoringEvents.log | Wolfram Language evaluation events |
logs/wolfram/SystemStats.log | CPU and memory statistics |
logs/wolfram/EventStats.log | statistics on internal events |
logs/wolfram/KernelInit.log | timings and messages from kernel initialization |
The EPC logging system uses log4j (log4j-core version 2.3) to compile logging information using its standard hierarchy of logging levels:
Level | Description |
OFF | the highest possible rank; logging is essentially turned off |
FATAL | severe errors that cause premature termination |
ERROR | other runtime errors or unexpected conditions |
WARN | use of deprecated APIs, poor use of API, "almost" errors, other runtime situations that are undesirable or unexpected but not necessarily "wrong" |
INFO | interesting runtime events (startup/shutdown) |
DEBUG | detailed information on the flow through the system; generally speaking, most events should be logged at the DEBUG level |
TRACE | most detailed information |
By default, EPC has all reports enabled at the level of INFO or higher for extensive but verbose log files. Note that logging DEBUG-level messages will produce very verbose logs and may negatively affect system performance. If this setting must be changed for debugging purposes, then the change should be reverted as soon as possible. It is not recommended that debug logging be generally enabled in production systems.
Log Examples
Log Examples
When the default INFO level of logging is configured, detailed information about requests that use EPC resources will be noted for use in monitoring, optimizing and debugging EPC usage. The most detailed information about activity in EPC can be found in /www/tomcat/logs/wolfram/CloudPlatform.log. Note that all lines will begin with these standard bits of information (when available): a timestamp associated with the event, the event’s request ID, the caller’s IP address, the designated log level and the name of the host on which a kernel was acquired, respectively:
[2018-11-02 08:03:25,352] [anonymous from XX.XX.XX.XXX] [gij78k_080325] [140.177.0.1] [INFO ] [http-nio-8080-exec-25]:
Here are several common Wolfram Cloud operations, along with a discussion of what log outputs one can expect when those operations have executed successfully. Please note that in most cases, line breaks and indentations have been added for clarity. All time indications are in milliseconds.
API calls have no log footprint at the moment of initialization. All log output is instead written when execution is finished or aborted. Basic information, in addition to a detailed collection of timings, shows up in /www/tomcat/logs/wolfram/CloudPlatform.log:
Note the timing data given in the API controller timings entry are predominantly for internal operations outside user control. The "WL controller" step, however, measures the time spent evaluating user-supplied Wolfram Language code inside the kernel. A poorly performing API that shows large timings in this step may be addressed by optimizing its Wolfram Language implementation. Large timings in other sections, exceeding tens of milliseconds, may indicate underlying system problems that may need to be addressed with the help of the Technical Support team.
Within the logs, there is no absolute indication that the given API call was successfully executed, and sometimes the previously shown message can appear even when the API evaluation fails halfway through. If that happens, timings of the later stages "WL controller" and "commit response" will be missing. If those are present, the API call most likely finished execution (although it still might have returned an error).
The request ID (in this case, i9hyla_034033) can be used to search for further associated logs, which can help you find out to which API evaluation this refers—and also whether anything went wrong. For example, in the case shown, we can see the actual submitted request in /www/tomcat/logs/wolfram/HttpRequestTimings.log:
Form evaluation generates a similar log output to API calls upon completion. (It is, after all, an API server that is serving the FormFunction). When submitting from a specific username (as opposed to anonymously, like the API call shown previously), the user’s cloud ID is included in the log:
In addition, one may find a corresponding message like this in Cloudplatform.log:
What is going on “under the hood” is made more explicit here (note the line “API server: Serving a FormFunction”). However, it is also plain to see that some useful information is missing; in particular, there is no request ID. To scan for further information, you can grep the logs for the UUID from the timings log message (in our example, 5f6402d5-e712-485b-b238-57fa1d43faf5).
Opening a document triggers a more comprehensive log entry. Note that all headers are provided:
Notebook evaluation triggers a log entry at the start and at the end of the process. Once again, the request ID (here, f0pdnq_013352) can be used to look for other related logs:
Scheduled task creation generates several log entries, in the following order: initialization, metadata summary (two entries), confirmation and evaluation timings:
The successful run of a previously defined scheduled task results in a series of rather self-explanatory log outputs:
If you need to investigate problems with a given scheduled task, be aware that sometimes log lines from various tasks are interleaved. The best way to gather related log events is to search for the Java thread associated with the process, in this case scheduler_Worker-1; you could also just look at the request ID field, 14u96y_011850.
CloudDeploy events do not generate any distinct log entries.
Monitoring: Other Resources
Monitoring: Other Resources
If the system seems slow, it could mean that the Java Virtual Machine’s (JVM) heap is full, which in turn means that the JVM’s garbage collector is likely running overtime. Run tail -f GC.log to see if there is more activity than normal. “Normal” means there should be a new GC run every few seconds; if the events are more closely spaced (e.g. close to one per second), then the system garbage collector might be the cause of responsiveness problems. In such a case, the JVM heap size should be increased.
Troubleshooting
Troubleshooting
Subscription Troubleshooting
Subscription Troubleshooting
For subscription troubleshooting, contact Technical Support via email at privatecloud-support@wolfram.com or by calling 1-800-976-5309.
Zombie Kernels
Zombie Kernels
The term “zombie process” refers to a process that has essentially exited but has not been collected by its respective parent process; the parent process neglected to undertake a system call to collect the process. These are not really a problem since they use virtually no resources.
Orphaned Processes
Orphaned Processes
The term “orphaned process” describes a program, typically front end– or MathLink-related, whose direct parent dies while the program for some reason remains alive. Such a process gets assigned a new parent process, which is usually process 1 (i.e. the Linux initialization process). Orphaned processes can cause serious problems by refusing to quit on their own and holding on to large amounts of RAM and other resources.
It is unclear why orphaned processes happen, and indeed they happen infrequently. However, if your EPC is not performing as expected, orphaned processes might be the culprit. Orphaned processes cannot be detected in the logs since they are disconnected from Tomcat, but you can check for them in the process tree by typing the following in the command line:
~$ ps -Heo ppid,pid,args
Note: ps is the standard Linux “process status” command.
◼
-H shows the process hierarchyNull
◼
-e selects all processes
◼
-o tells the compiler that you wish to specify a format for the output; in this case, you are asking to view the process IDs (pid), grouped by the parent process ID (ppid), along with any program arguments (args)
This returns all current processes, sorted by their parent process ID. Here is what to look for:
◼
Processes associated with Mathematica, or occasionally Mathlink (usually process names ending with .exe, e.g. PNG.exe, …)
◼
Processes whose parent process ID (ppid) is 1
All these processes are launched by the Tomcat Java process and should have a ppid that is not 1 (there are, of course, plenty of legitimate processes with ppid of 1, e.g. the Tomcat Java process itself).
Running Out of Resources
Running Out of Resources
Request Timeouts
Request Timeouts
Just like on the public cloud at www.wolframcloud.com, requests to EPC are subject to a 30-second time limit. It is possible for an administrator to adjust this value, if needed, by editing the value of KernelPoolAcquireTimeLimit, which appears at the bottom of tomcat/webapps/app/webapp/WEB-INF/MSPConfiguration.xml. The value is given in milliseconds:
<KernelPoolAcquireTimeLimit>30000</KernelPoolAcquireTimeLimit>
This value reflects the maximum time for a request to be in the queue and to acquire the needed kernel(s). It is not meant to be a proper queueing system—that is, it is not a system that takes an arbitrary number of requests that the cloud will then handle one by one. For one thing, requests are all held in RAM, so that if the system or the Tomcat server is restarted, the outstanding requests get dropped.
Another caveat when raising the request lifetime limit (i.e. how long the request stays in the queue) is that too high of a setting might interact badly with various other parts of the HTTP stack, depending on the user’s network, proxy, etc. configuration. For example, your EPC might live on a network where user proxies will cut a connection after a few minutes; in such a scenario, setting a very high request limit (e.g. an hour) will not solve anything, if the infrastructure cuts connections or otherwise disrupts your network access.
Insufficient Kernels Allocated
Insufficient Kernels Allocated
If there are not enough kernels allocated to handle an expected amount of concurrent requests, there will be an error message. That message depends on a few factors.
Suppose your EPC has four deployment kernels, and you publish an API that runs for 60 seconds. You send four simultaneous requests to that API, each of which acquires a kernel and uses it for 60 seconds. Once these API calls are running, another call comes in; since all kernels are used up, the API call cannot acquire a kernel right away. The system will then put that kernel acquisition request into a queue to wait for a kernel to become available. If another call comes in, that call will also be appended to the same queue.
Any calls waiting on this queue have a 30-second timeout, i.e. if the call has been in that queue for 30 seconds, the system will drop it and fail that kernel acquisition. This means the call cannot get a kernel and will also fail.
In that case, an HTTP 503 error is returned to the user. Look for a message in KernelLifeCycle.log that reads “Kernel acquire time exceeded limit (make sure that unused kernels are released, or increase the number of available kernels).”
If there are even more requests coming in on that busy EPC, another limiting factor comes into play: the queue that holds calls for future evaluation, mentioned previously, has a fixed size. By default, it can store 10 pending requests. If the queue is already full when a request comes along, that request will fail right away, returning an HTTP 503 error. A message will appear in KernelLifecycle.log that reads “Exceeded maximum number of queued requests, temporarily refusing new ones.” To increase the size of the queue, a configuration property called APIRequestQueueSize can be set to a higher number.
Technical Support
Technical Support
This document currently provides limited resources for EPC troubleshooting, but it will be expanded over time. The majority of technical difficulties will require input from the Wolfram Technical Support team, which is available 24 hours a day, 7 days a week. To contact Technical Support, send as much detail about your problem as possible to privatecloud-support@wolfram.com, including your license number, a description of your problem and the relevant log output, if appropriate. Be sure to provide your license number and account info. It is possible, albeit less efficient, to contact private cloud support by phone at 1-800-976-5309.