@RELEASE: 6.4.0

    ==== CL 9973 ====
    @FIX: bug where UNC paths with backslashes won't work in the new worker_path_map

    @INTERNAL: Note: Backslashes are now NOT treated as special chars in QbConfigFile's tokenize() routine (called from parse())

    ==== CL 9966 ====
    @NEW: pyCmdline - a python-based implementation of cmdline jobtype

    ==== CL 9963 ====
    @FIX: add launchCondition so that worker and supervisor will not install if core is not present
    @NEW: write a registry key upon installation in order to provide dependency checking for core removal (core will not uninstall if worker or supervisor is installed)

    ==== CL 9959 ====
    @NEW: adding back-end run-time path conversion feature, and exposing in perl, python, and C++ APIs (qbconvertpath())

    ==== CL 9953 ====
    @FIX: fixed config file (qb.conf) parsing code so that it properly parses the worker_path_map

    Note: old code was corrupting qb.conf when upgrade_config tool was run.

    ==== CL 9937 ====
    @NEW: houdini loadOnce jobtype finds the appropriate houdini installation at runtime, based off HFS and optionally pkg['houdiniVersion'], user no longer has to guess at python path on the remote worker
    @NEW: add versionPicker controls to QubeGUI Houdini submission UI
    @NEW: new multi-line syntax for application paths in the job.conf file
    @NEW: added scanConfForPaths to backend utils module

    ==== CL 9930 ====
    @NEW: added qbworkerpathmap() to the C++ API and qb.workerpathmap() to the python API.

    The worker_path_map in qb.conf (or qbwrk.conf) must be defined like:

    worker_path_map = {
    H: = /home
    X: = /proj/x

    Note, in particular, the "[direct]" keyword. That MUST be present.

    qb.workerpathmap() called in the python backend will return a nested dict of the format:

    {'directmap': {'X:': '/proj/x', 'H:': '/home'}}

    @INTERNAL: fixed bugs in the config-file reader code, added a bunch of comments

    ==== CL 9918 ====
    @UPDATE: update Use.doc with table of all job flags and their descriptions, including info on the new migrate_on_frame_retry job flag

    ==== CL 9915 ====
    @NEW: added a new job flag "migrate_on_frame_retry", which, if set, forces a subjob to migrate to another worker if it fails a frame, and the frame is set to automatically retry (via retrywork).

    ==== CL 9909 ====
    @FIX: fixed issue that was causing jobs to NOT be considered for dispatch immediately at submission.

    Bug was introduced while attempting to fix a memory leak bug, in CL9592

    ==== CL 9903 ====
    @FIX: better message from worker when it rejects a dispatched subjob because it's a duplicate (being preempted or migrated on the same worker)

    ==== CL 9893 ====
    @NEW: add example qb.conf files for various-sized farms
    @NEW: add example qbwrk.conf to the build

    ==== CL 9891 ====
    @FIX: _highest_priority() routine to disregard priorities that are non-positive.

    ==== CL 9886 ====
    @UPDATE: admin doc with info on supervisor_highest_user_priority

    ==== CL 9882 ====
    @FIX: fixed pathmap bug where the object/data wasn't being properly transmitted over the network at all.

    @CHANGE: also uncommented the line that prints out the pathmap to the workerlog on worker boot.

    ==== CL 9865 ====
    @NEW: Added support for supervisor_highest_user_priority to the GUI's "Local Configuration" dialog.

    @NEW: Added supervisor_highest_user_priority to the qb.conf.template file.

    @CHANGE: Also modified the description of supervisor_max_priority in qb.conf.template to avoid confusion.

    ==== CL 9864 ====
    @NEW: added qb.conf setting "supervisor_highest_user_priority", which sets the highest priority (i.e., smallest numerical value) at which an ordinary (non-admin) user can submit/modify jobs.

    Users must be qube admin to be able to submit/modify at higher priority than this value.
    It's default value is 1.

    BUGZID: 63717

    ==== CL 9838 ====
    @CHANGE: upped the default value for supervisor_max_threads to 100, and worker_max_threads to 32

    ==== CL 9837 ====
    @CHANGE: update the qb.conf templates, supervisor_max_threads=96, leave it uncommented until such time as this matches the supervisor's default behavior

    ==== CL 9788 ====
    @TWEAK: improved log message when worker goes into panic because of the lack of sufficient permissions

    ==== CL 9785 ====
    @FIX: worker issue where desktop worker would randomly crash.

    ZD: 6778

    ==== CL 9736 ====
    @NEW: add support for MySQL passwords to qb.query.mysqlConnect

    ==== CL 9711 ====
    @NEW: add Admin->Database Check/Repair functionality to the GUI
    @TWEAK: add ability to print to logPane in realtime for long-running processes, no need to wait until operation is finished
    @FIX: bugfix for Admin->Ping Supervisor raising KeyError when supervisor is down

    ==== CL 9698 ====
    @FIX: fixed false-negative warning message pertaining to "select() in checkpoint()" seen in supelog.

    Examples of these messages:

    select() in checkpoint(): Operation timed out
    select() in checkpoint(): Interrupted system call

    ==== CL 9694 ====
    @FIX: fixed issue with the supe threads getting tied up on "subjob X seems to be already assigned" message.

    On a farm with busy workers, the time between the supe dispatching a sub
    job to the worker via assignJob() and the worker reporting that the "subjob
    is running" can be several seconds to sometimes even several minutes, which
    was causing many supe threads to attempt dispatching the same subjob over
    and over. All of those threads end up hitting the "subjob X seems to be
    already assigned... retrying" message, and get tied up for 3 seconds while
    they retry.

    ZD: 6760 7125

    ==== CL 9689 ====
    @FIX: fixed bug in clustering algorithm where it incorrectly gave more
    weight to a job when the only difference was the last letter in the cluster

    For example, if:
    host cluster: /3D/projA
    job1 cluster: /3D/projB
    job2 cluster: /3D

    job1 was getting more weight than job2, which is incorrect.

    BUGZID: 63740
    ZD: 7043

    ==== CL 9687 ====
    @INTEG: rel-6.3 -> main CL 9686
    @FIX: using deprecated "waitfor" attribute with Python api causes qb.submit() to raise a KeyError
    @FIX: properly convert "waitfor" value (jobid integer) to proper "dependency" string of "link-done-job-<id>"

    ==== CL 9686 ====
    @FIX: using deprecated "waitfor" attribute with Python api causes qb.submit() to raise a KeyError
    @FIX: properly convert "waitfor" value (jobid integer) to proper "dependency" string of "link-done-job-<id>"

    ==== CL 9678 ====
    @NEW: provide a "Studio Overrides Prefs" in the QubeGUI which will allow mandated studio-wide preferences, will override userPrefs, which already override the "Studio Defualts Prefs". Added support for --studioprefs cmdline option and QUBEGUI_STUDIOPREFS environment variable.

    ==== CL 9677 ====
    @INTEG: rel-6.3->main CL 9676
    @FIX: update documentation and GUI help text to show correct "||" syntax for job restrictions list.

    ==== CL 9676 ====
    @FIX: update documentation and GUI help text to show correct "||" syntax for job restrictions list.

    ==== CL 9664 ====
    @CHANGE: specify unix_socket when connecting to MySQL server on localhost on non-Windows platforms

    ==== CL 9663 ====
    @INTEG: rel-6.3 -> main CL 9662
    @FIX: supervisor install was failing postflight scripts on OSX Server, expliclty set the mysql socket to /tmp/mysql.sock in /etc/my.cnf and /etc/qb.conf to avoid conflicting with the factory-installed default of /var/lib/mysql/mysql.sock

    ==== CL 9662 ====
    @FIX: supervisor was failing postflight upgrade scripts on OSX Server, expliclty set the mysql socket to /tmp/mysql.sock in /etc/my.cnf and /etc/qb.conf to avoid conflicting with the factory-installed default of /var/lib/mysql/mysql.sock

    ==== CL 9615 ====

    @FIX: Added code to properly log frames (to supelog and job log) when they go back to "pending" after the processing subjob/worker is found dead.

    @FIX: Added code in the supervisor to retry a failed worker connection
    after a random 5-10 sec sleep/delay, to alleviate network hiccups during
    network commands (kill, preempt, etc. of running subjobs).

    ZD: 6760

    ==== CL 9614 ====
    @INTERNAL: fixed a small cosmetic bug introduced in CL 9606

    ==== CL 9607 ====
    @INTERNAL: added converseWorkerWithRetries() and also fixed small bug in the retry loop of converseSubSupervisorWithRetries()

    ==== CL 9592 ====
    Fixed code that was causing memory leaks when supervisor threads handled
    job submissions.

    ==== CL 9585 ====
    @FIX: issue where some jobs get stuck in the "dying" state when attempted to be killed

    ZD: 6616

    ==== CL 9578 ====
    @NEW: add another python example script which shows a 'block until' type of callback; a job can be submitted to run at a certain time of day, if the TOD is in the past, it's assumed to be tomorrow

    ==== CL 9570 ====
    @FIX: improvements to the handling of GET_LOCK (aka"reserveJob()") timeout situations.

    ZD: 6617

    ==== CL 9549 ====
    @FIX: qbwrk.conf files that had any commented-lines before the first valid template was encountered would cause an exception to be raised, QubeGUI->worker->RMB->Configure would fail silently

    ==== CL 9535 ====
    @NEW: add submit-agenda-timeout-job.py example python script, to demonstrate submission of a job with frame-level timeouts.

    ZD: 6099

    ==== CL 9530 ====
    @FIX:Submitting paths to shotgun no longer depends on the visibility of output paths to the supervisor.
    @FIX:Shotgun submission script fails gracefully & logs a reason as to why it can't generate a thumbnail when thumbnail creation fails.

    ==== CL 9523 ====
    @FIX: fixed issue where the supervisor fails to correctly track the host assignment for subjobs.

    Symptom for this included seeing in the supelog, messages like "statusJob(): aberrant report from worker...", then followed by "subjob[xxxx] is assinged to worker[] with mac address[00:00:00:00:00:00]".

    These subjobs would then be in the "running" state, but not assigned to a worker.

    ==== CL 9522 ====
    @FIX: removed code that skipped code that made local decision on the supe to test for resource reservations, for jobs with host.processors set to > 1, delegating the decision-making to the workers and resulting in more network traffic and latency.

    ZD: 6141

    ==== CL 9507 ====
    @FIX: added more robust code that talks to the SMTP server when sending out email,
    to support some email servers with non-standard response behavior.
    ZD: 6209

    ==== CL 9504 ====
    @FIX: catch case where sg_path_to_frames is part of the Shotgun versionName, but the job has no outputPaths for the first frame; fallback to naming the version "job id: 123 jobName: ..."

    ==== CL 9500 ====
    @FIX: Windows Vista/7/2008-R2 installer - don't error out when installing the worker or supervisor as an Admin-equivalent account during creation of scheduled tasks. Properly remove scheduled tasks during uninstall.

    ==== CL 9496 ====
    @FIX: catch case when inserting in a new cluster into cluster_dim when more than 1 worker exists in the new cluster; occurs during run of regular_slotcount.sql, doesn't prevent new record from being added, just generates line noise and error emails from cron...

    ==== CL 9494 ====
    @CHANGE: make explanation of "+ | *" in job/host restrictions less ambiguous

    ==== CL 9484 ====
    @FIX: calculate cpu-seconds for agenda-based jobs by summing up work times, not subjobs. Better support for resetting of the start times for retried work.

    ==== CL 9467 ====
    @NEW: add a random offset to the startup so that all workers don't report at the same time if they've started up at the same time.
    @CHANGE: don't retrieve job name, it's extraneous and not reported; cuts down the query count by one.
    @CHANGE: set workname for subjob to job.subid, not subid; easier to detect case where an agenda-based job falsely reports not having an agenda, so subjob id won't conflict with a frame number

    ==== CL 9463 ====
    @FIX: don't report memory usage in the case where MySQL fails to return a valid agenda name, usually caused by timeouts or maxed out connections.

    ==== CL 9461 ====
    @CHANGE: removing from VS solution: qbdeletevariable qbgetvariable qbsetvariable qbworkervar

    ==== CL 9460 ====
    @CHANGE: removing legacy commands from sbin-- qbworkervar, qbdeletevariable, qbgetvariable, qbsetvariable

    ==== CL 9459 ====
    @NEW: added ip address column ("address") to the banned DB table
    @NEW: enabled "qbadmin w -unremove <worker>" to work with hostname and IP address, in addition to the mac address.

    BUGZID: 63703

    ==== CL 9458 ====
    @NEW: adding QbTableVersion30.cpp to upgrade_supervisor.vcproj

    New DB table schema definition file for rel-6.4

    See also the previous changelist, CL9451

    ==== CL 9456 ====
    @FIX: moved the location of QbTableVersion29.cpp (rel-6.3) inside the upgrade_supervisor.vcproj file from the incorrect "Resouces Files" folder to the proper "Source Files" folder.

    It appeared as though the file was missing from the build.
    (probably mostly only cosmetic, but was also was confusing).

    ==== CL 9455 ====
    Back out changelist 9453, 9454

    Changes were somehow not effectively made to the vcproj files, so trying again after backing off these CLs.

    ==== CL 9451 ====
    @NEW: adding "name" column to the "banned" table

    Note that this involves a DB table schema change. A new table definition, QbTableVersion 30, is added, and will be released with 6.4.0

    BUGZID: 63681
    ZD: 5271

    ==== CL 9449 ====
    @FIX: fixed issue with removal of workers using the mac address (i.e. "qbadmin -worker remove <macaddr>") not working properly.

    BUGZID: 63447

    ==== CL 9446 ====
    @FIX: added "pgrp" modifying support to the supervisor code and the qbmodify() C++ API, qb.modify() Python API, and qb::modify() Perl API routines, and added a "-mpgrp <int>" option to the qbmodify command-line tool.

    BUGZID: 63680

    ==== CL 9443 ====
    @FIX: Added missing "qb.hostorder(id=JOBID)" routine to the python API.

    ==== CL 9442 ====
    @FIX: modified to raise exception when parameter "fields" is not of type list.

    BUGZID: 63627
    ZD: 3998

    ==== CL 9440 ====
    @FIX: variables such as $qb::jobid not working in callbacks on Windows

    BUGZID: 63686
    ZD: 5240

    ==== CL 9438 ====
    @FIX: minor fix to a perl example, callback3.pl, so that the job cmdline works in Windows too.

    ==== CL 9427 ====
    @FIX: added code to make sure all end-of-line in email data are CRLF (not just LF) in accordance to RFC2822.

    This was causing notification emails to not work with some email servers, as they will not responding, and the communicating supe thread would just stall.

    ZD: 5752

    ==== CL 9411 ====
    @FIX: added code to chmod and open up the file permission of .out and .err files in the job log folder.

    This was causing subjobs to fail on systems with "mounted" job log path, as the supervisor will initially create these files when when a subjob that previouly never started is retried (the supe writes "qube! - retry/requeue on blahblah...") under the "root" user's ownership with mode 644, and the workers who get the subjobs can't write to it.

    ZD: 5965

    ==== CL 9407 ====
    @CHANGE: set upper limit for mysql user filehandles to 70,000; 'open-files-limit' setting in my.cnf is only a suggestion, mysql can auto-determine to a larger number, but it's internal max value in 65535. Setting ulimit upper bound larger than 64K should prevent mysql from ever running out of file handles.

    ==== CL 9402 ====
    @FIX: adding "qbhash" command to windows.

    ==== CL 9395 ====
    @FIX: fixed issue causing the supervisor to crash at initialization, right after "finding other supes..." was printed in the supelog.

    The fix was in one of the base commuinication library routines QbConnection::receiveUdp().

    Sometimes, unknown/malformed data would be received on the UDP socket, and was causing the code to attempt to access beyond the buffer array (index out-of-bounds error).

    ZD: 5638
    BUGZID: 63305

    ==== CL 9370 ====
    @FIX: recreate the pfx_dw stored procedures and functions on Windows, as the MSI installer wipes them out during an upgrade.

    ==== CL 9342 ====
    @FIX: fixed a supe thread crashing issue, when global_host or license_host resource tracking is used.

    ZD: 5749

    ==== CL 9334 ====
    @FIX: add error handler for MySQL error 1146 "Table 'x' doesn't exist" for work and cpu time calculations for job data collector script
    @NEW: increment datawarehouse version to 10 to allow for installing this patch into existing databases

    ==== CL 9318 ====
    @FIX: fixed crash bugs that were introduce when the "dying" state was implemented for 6.3.1.

    ZD: 5794

    ==== CL 9312 ====
    @FIX: add mail template for auto-wrangling emails to the installers

    ==== CL 9299 ====
    @FIX: add mail template for auto-wrangling emails to the installers

    ==== CL 9277 ====
    @NEW: increase file handle limit for mysql user on Linux installs to 64K

    ==== CL 9274 ====
    @FIX: create global resource tables in data warehouse DB if they don't exist; creation was failing to happen in new DB installations.

    ==== CL 9265 ====
    @FIX: fixed job-level history not being recorded into .hst file.

    (Bug was introduced in CL9145, 9146)

    ZD: 5609

    ==== CL 9261 ====
    @CHANGE: cut down on the cmdline & cmdrange jobtypes' stdout; don't print 'LOG: ...' lines, make regex summaries much clearer, change printing or regex's to stderr to make it clearer that they're not actual errors, but rather things being searched for in the stderr stream.

    ==== CL 9252 ====
    @FIX: properly find qb.conf on Windows versions Vista and later when unable to contact the supervisor directly.

    ==== CL 9245 ====
    @FIX: GUI changes to be able to handle when supervisor host goes down, and both supervisor and MySQL server are unavailable. Also fix jobList not refreshing on down supervisor.

    ==== CL 9241 ====
    @FIX: fix GUI crashbug in MySQLConnect when supervisor does not answer a qb.ping

    ==== CL 9239 ====
    @FIX: global resource tables were not getting created in new instances of the datawarehouse db, only on upgrades.

    ==== CL 9232 ====
    @FIX: fixed example python code (jobSubmit06.py) to work on Windows too.

    ==== CL 9211 ====
    @FIX: added code to prevent the QbQueue::getSubjobReadyfindReady() routine from returning the same subjob to be dispatched over and over.

    This was causing the findSubjobAndReserveJob() and startJob() routines to
    hit the "subjob [N] seems to be already assigned" situation, and cause
    threads to enter a long, sometimes semi-infinite, sleep-and-retry loop.

    Fixed by adding code in the startJob() routine to quickly update the subjob
    status when the the assignJob() returns QB_ASSIGN_OK (i.e., worker says it
    has accepted the subjob), instead of waiting until the worker later reports
    that the subjob is "running" via the STATUS_JOB message, which can take
    more than several seconds on a busy farm.

    Also reduced the number of maximum retries to 3 (MAX_ATTEMPTS), in the
    situations where a subjob "seems to be already assigned" or when a worker
    host says it's busy (QB_ASSIGN_BUSY). This prevents the threads to get
    stuck for 10 or more seconds in a sleep-retry loop, and allow them to give
    up quickly and move on.

    ZD: 5449

    ==== CL 9198 ====
    @FIX: fixed issue with non-node-locked licenses ("FF:FF:...") not working (since 6.3.0)

    ==== CL 9174 ====
    @INTEG: rel-6.3->main CL 9173
    @FIX: ensure that mail sent by "qbadmin --emailtest" is RFC2822-compliant (no bare LF's, only CRLF)

    ==== CL 9173 ====
    @FIX: ensure that mail sent by qbamdin --emailtest is RFC2822-compliant (no bare LF's, only CRLF)

    ==== CL 9161 ====
    @NEW: add support for new 'dying' state into the GUI

    ==== CL 9150 ====
    @INTERNAL: QbDebug::filename(QbString) took if statement out, so resetting _filename is allowed

    ==== CL 9145 ====
    @FIX: disabled logging to /var/spool/qube/{host,user}, as it was creating large log files and causing sluggish performance.

    An option to enable these logs may be made available in the future.

    ==== CL 9142 ====
    @FIX: fixed issue where global resources tracking drift sand more subjobs than can be accomodated by the actual global resource count is dispatched.

    ZD: 5074

    ==== CL 9133 ====
    @INTERNAL: CentOS support for "buildpyc" in rpm/quberpm.pm

    ==== CL 9105 ====
    @NEW: A new transitional "dying" state for jobs that have been ordered to be "killed", but still being processed by the system

    ==== CL 9084 ====
    @CHANGE: increase MySQL max_allowed_packet value from default of 1MB to 64MB to decrease frequency of "MySQL server has gone away (2006)" error messages.

    ==== CL 9083 ====
    @CHANGE: increase MySQL wait_timeout value from default of 8 hours to 36 hours to decrease frequency of "MySQL server has gone away (2006)" error messages.

    ==== CL 9066 ====
    @FIX: fixed "cpus" (subjob) count inaccuracy when a job's "cpus" was modifed down and then up.

    For example, if a job with initially 10 "cpus" was reduced to 5, then
    subsequently increased to 6, the system had inaccurately recomputed the
    subjob count to be 10.

    ==== CL 9058 ====
    @FIX: renaming logs during rotation would fail on Windows

    ==== CL 9037 ====
    @FIX: rename the globalResource_fact table to be all lower-case; causes issues stored procedure PFX_CREATE_DATASUBSET_TABLE() which errors out with "ERROR 1050 (42S01) at line 1: Table 'globalresource_fact_12h' already exists" (note lower-cased name)

    ==== CL 9016 ====
    @NEW: adding license agreement for 3rd-party software
    @NEW: also adding our own License.rtf to the docs dir.

    ==== CL 9013 ====
    @NEW: added description of supervisor_job_flags in the qb.conf.template file

    ==== CL 9010 ====
    @FIX: fixed memory bloat issue in supervisor threads on start up, on farms with many jobs.
    In some cases, it had been reported that each supe thread was taking up 500+ MB.

    ==== CL 8939 ====
    @FIX: fixed another small "hole" that could cause race-conditions to dispatch a single subjob more than once

    ZD: 4783
    BUGZID: 63657

    ==== CL 8937 ====
    @FIX: supe issue where the same subjob can be dispatched more than once to worker(s).

    ZD: 4783
    BUGZID: 63657




    • No labels