Tips of the Day

I express my heartfelt gratitude to the mentors of my research career, Prof. Philipp Kraehenbuehl, Dr. Jiecao Chen, Prof. Alexander Schwing, Prof. Yu-Xiong Wang, Dr. Jie Yan, Dr. Chuan Luo, Prof. Changliu Liu and Prof. Zongqing Lu (sorted by time), as well as all the wonderous people I have met at Apple, ByteDance, UIUC, MSRA, CMU and PKU. I learned all these tips from my experiences working with you.

"The palest ink is better than the strongest memory."

Vanilla A2C/PPO without reward shaping/prolonged episode/ exploration skills are actually hard to deal with mountain car, as the reward is too sparse.
It is important to do state/reward normalization for PPO to maintain numerical stability.
DO NOT put any function that changes global variables in Pycharm’s watch! (e.g. a function in Pycharm’s watch which adds a global counter by 1 may cause the wrong value of the counter).
Gurobi, Mosek and SCIP (not scipy!) are the best open-source optimization solver. CVXPY integrated many of them.
Don’t use scipy in your research project as an optimization solver; use Gurobi instead. An academic license costs $0, yet Gurobi is ~250x faster than scipy (and also more numerically stable).
Normally, a good solver (e.g. Gurobi) will do some numerical tricks for actions that may cause singularity.
If you don’t know what hyper-parameter to set, go find a previous work and inherit their params. This will help to convince the reviewers that your idea works.
Randomly initialized NN has a compressing effect (see Benjamin/Rechat’s work), which means its output is probably a contract mapping (with possible shifts) with random inputs. This effect can be used in anomaly detection.
When dealing with a temporal sequence, use the first part (e.g. year 1-6 for a 10-year dataset) as the training set, then validation set, finally the test set.
Prediction models for time sequence (e.g. electricity/VM demand) usually underestimates, for there are systematic bias (e.g. peaks) in the dataset. On the other hand, underestimating the demands are usually more serious than overestimating in real life.
You can get Azure VM information from Kusto.
Exploration for RL matters, even for toy environment. The same environment with different default behavior for illegal actions (e.g. stay or randomly moving or giving large negative reward) causes huge performance gap for A2C. As for my own experience, the first two are better choices.
L1 loss fits better for data with sparse entries, and is more robust against outliers.
The goal of experimental parts in a paper is not stating “what we’ve done”. It should be organized by “What we’re going to validate” (e.g. Why do we design this experiment, and what is the conclusion).
The MIT book and the Boyd book are the two classical textbooks for convex optimization; strongly recommending the two books.
The difference of \forall and for x \in X: The former emphasizes “satisfaction of conditions”, usually used in proofs of advanced mathematics; the latter is an enumeration. They are generally the same, but proper usage helps comprehension for readers.
A sparse embedding (e.g. holiday tag) with small training set is inherently infavorable over two-stage method and favors decision-focused method.
Write papers! Only by writing papers can you be more rigorous in language for papers.
Constraint is for decision variables’ feasible domain. The relationship between problem parameters should not appear in the constraint part.
tensor([0]) < 0.5 is False. Note the round down of integer types of torch.tensor!
To check the difference of two general distributions (e.g. When you are comparing the performance of two methods), mean and std are not enough. Try percentile, histograms and Maximum Mean Discrepancy!
Add axis label and title for debugging figures, as you may forget what you were plotting.
Do periodically save your code and model for an actively debugged program; preferably automatically doing so every time you run your code.
A L1/L2 regularization is by essence Lipschitz regularization for target function.
Some ways to note current update for your research field: a) arxiv subscribing cs.AI cs.LG, plus manually searching the key word *proceedings of ICML、NeurIPS、ICLR、UAI、AISTATS, etc, and b) reddit.com/r/MachineLearning
Put a demo one-line run script for cmd/shell in your project readme. The most common one will do.
Do note your notations for theoretical parts, and try your best to make it coherent for each of the theorem / both main paper and appendix.
Recurrent DDPG is unreliable and hard to tune. MADDPG/Recurrent MADDPG is even more painful. So do recurrent TD3; try to avoid recurrent policy if you want stable performance.
Programming dataset, e.g. GAMS, has a very large number of dimensions for decisions (e.g. >100k).
A noise of ~0.05 over a value 1 causes a SNR less than 15db, and by this aspect is not a small noise.
If you can tell a good story / establish a good framework, then the experimental part will be much easier as it only serves as a validation. Otherwise, your research will be an empirical one, which requires high demand on performance.
General Multi-Objective problem may seem luring, but it is not trivial: pareto optimal means balance over multiple goals, yet such goals usually depends on the settings of real scenario.
“Add noise then discretization(e.g. rounding)” is more close to reality than “discretization then add noise”.
Sometimes, if the experiment code is not working, you can fix some elements to debug. E.g. for off-policy 2-step RL, you can fix the first step and try to train the 2nd step; if the current picture training set is not working, you can pick one picture as the training set to see if it can overfit; if not, the code may be buggy. However, such practice (the one datapoint method) may face the problem of not having enough support for optimization surface, so it is not a panecea.
Intuitively, the following situation will put decision-focused method at advantage over 2-stage method: a) the optimization part, with surrogate, has a differentiable argmax and good generalization, and b) the prediction part has some outlier dimensions which has low weight on optimization quality.
If you find an unnecessary condition set in your experiment due to early decisions, If you have no time for re-runs, you can simply explain the condition in the appendix, and give a real-life example if necessary.
For a multi-dimensional decision vector in optimization, the influence of single/minority number of dimension may be overwhelmed.
2-stage early stopping has an inherent logic of “doing prediction well first”. Thus, it should be early stopping according to prediction loss instead of optimization performance.
Significance tests are usually conducted in traditional statistic works for hypotheses, especially where test set does not exist.
Use on-policy methods for MARL, as stationarity is not preserved!
When you are imitating someone else’s code but failed, a useful debugging method is to take his code, and changing his code into yours function by function (instead of changing yours onto his). You can try the differnet versions of code in parallel to quicker iterate.
Batchnorm is influenced by eval/train! By default, the running_stats is on. Then for training, the normalization is conducted with batch statistics; but for evaluation, the normalization is conducted with a fixed mean and variance estimate kept with a momentum of 0.1. This could have a VERY BIG influence if you ignore the difference.
You can try to feed feature^2 besides feature into MLP to get better expressivity, which works particularly well in fitting near-quadratic functions.
torch implementations such as logsumexp are numerically stable, and should be used instead of self-implemented vanilla code.
Be patient when you are training a large network. For a classifier, the training loss may be not decreasing in a relatively long period at the beginning of the training (although the output is changing greatly), but the loss will decrease quicker in the later training process.
One technique for serious outliers in a dataset is to clip the loss to a constant, e.g. minimize max(-log(y|x), 0.1); this effectively “rejects” the gradient from the outliers and upper bounds the loss.
Note: Pytorch passes address, so if you want to only pass value to a function, make sure that you use clone() function! (e.g. for normalizing flows)
Do not trust “manual design” too much against randomization in deep learning. (e.g. permutations of channels in normalizing flows)
Note that torch.KLDivLoss(q.log(), p) = KL(p||q).
When you are tuning performance, try keep observing the curve for the first run if possible; this takes a little time, but it helps you to grab a sense of what is happening, and what epoch is the best. Also, try to run your code through before starting a long experiment (e.g. set epoch to 1 to see if the model can save correctly).
Use the following code to fix your pytorch random seeds, preferably at the beginning of main process:

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) # when using multiple GPUs
    torch.cuda.manual_seed(seed)     
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True 
    # torch.use_deterministic_algorithms(True) use with caution; this line of code changes many behavior of program. 
    torch.backends.cudnn.benchmark = False # CUDNN will try different methods and use an optimal one if this is set to true. This could be harmful if your input size / architecture is changing.

And don’t forget to use env.seed(seed) for your gym environment!

Note: once the random seed is set anywhere in this process (regardless of which file it is in), the seed remain fixed (unless implicitly set by other libraries).

You should reduce the learning rate if you are using batchnorm. Batchnorm changes the landscape.
What is the difference between optimizing KL and reverse KL? Mode-seeking (reverse) and mode-covering (forward)! See https://www.tuananhle.co.uk/notes/reverse-forward-kl.html for a brief explanation.
You can use the following code to visualize your RL episode in your gym step:

    img = env.render(mode='rgb_array')
    IMG.append(img)

and at the end of the episode, you write

 imageio.mimsave(name+'.mp4', IMG, fps=25)

If you need to change distribution in expectation in your derivation, try importance sampling. But as this introduces a possibly instable denominator, you may need surrogates to stabilize the whole thing.
If you are encountering strange problems in python import (e.g. missing .so), there are a few possible things that you can do:

1) check if the import path is wrong. For example, is your python importing a package from /.local/lib/python3.x/site-packages instead of your conda directory?

2) check both pip and conda envrionment, especially where there is a version mismatch between python xxx.__version__ and pip list / conda list. https://www.anaconda.com/blog/using-pip-in-a-conda-environment Actually, you may want to avoid using pip and conda together.

3) check your python, linux, and some system library version (e.g. MPI) under different settings. Sometimes the problem comes from version mismatch.

4) DO NOT try to copy-paste a file from another version into this version and simply change its name for a quick fix, unless you absolutely know what you are doing. While sometimes it can fix the problem, this creates an environment that is hard to reproduce and would be questioned by future readers of your papers.

There are many people write ML code, but many of them do not write the code well. Make sure you have seen others’ code and wisely referred to them for your own code; you should not easily believe in one single reference, even if it has many stars.
Whenever you see a logsumexp() in your formulation, ask yourself if you have made it robust by subtracting the largest term. (torch.logsumexp already does this for you.)
When using remote Python debugger with pycharm professional, you can close “attach to subprocess” option if you witness strange bugs. (At settings -> build, execution, deployment -> Python debugger)
When developing a new architecture for deep learning, you cannot simply considering throwing a dart randomly and hoping it can work. You should deviate from original design gradually, that is, stand on the shoulders of the giants.
You should never use gradient clipping along with weight decay. The gradient clipping take effect Before weight decay, thus greatly amplifying the weight decay factor and cause the training to be weird.
You should not copy models by iterating through parameters(), as parameters() does not contain parameters from the batchnorm layers. Use copy.deepcopy() or load_state_dict() instead.
If you are confronting strange problems in installing a package with “pip install -e .”, take a look at the setup.py, especially if you cannot find any version. Sometimes, authors will use “git+git@xxxx” to fetch dependency posts. you should change it to git+https://xxxx … as you are not collaborator / author of the repo.
If you cannot run “git submodule update –init –recursive”, you should check .gitmodule file to see if there is problem, especially that mentioned in 63. After that, “run git submodule sync” and the problem should be fixed.
Even the smallest library/package version change can cause a big difference in a particular environment. For example, mpi4py 3.1.1 and 3.1.3, though seemingly no big difference in the update log, can decide whether a program is runnable or not.
Different version of GPU (e.g. Nvidia A6000 and RTX 2080Ti) and different computation platform (e.g. GPU and CPU) could lead to non-egligible value difference when doing matrix multiplication! See https://forums.developer.nvidia.com/t/cpu-and-gpu-floating-point-calculations-results-are-different/18175 for details.
If we decrease the number of steps in the diffusion model, for each sampled diffusion timestep t, on average, the product of \alpha, which is \bar{\alpha} will increase as there are less terms less than 1. As we are fitting epsilon, this leads to lower signal-noise-ratio for epsilon and higher MSEloss. Therefore, fewer number of steps requires higher beta. (https://arxiv.org/pdf/2006.11239.pdf)
Remember to save your powerpoint every time you made a slide, and add a timestamp to your experiment results.
You can use the following code to output all attributes in args from argparse:

for arg in vars(args): f.write(str(arg)+" "+str(getattr(args, arg))+"\n")

Use the same color for the same method throughout the paper of your work, and same notation for the same thing as well.
Keep the citations nice, neat and simple for your papers. Just keep the platform of publication (e.g. ICML, Nature), paper name, year and author on it and don’t put the pages / publishers etc.
Use grammarly to check the writing of your paper, but do not overly rely on it. It may not recognize terms in your field and make mistakes.
You can use a plain notation in Python for elementwise multiplication (Hadamard product), but need to state elementwise clearly when writing papers.
Parenthesis around sum symbols should be at least as large as the sum symbols.
Beware of any presence of your identification in the code of paper, including absolute path and platform username (e.g. for wandb)!
Figures and its captions should be self-contained, especially in appendix where the space is unlimited; put settings and brief conclusion there.
The most important hyperparam for PPO is #update epoch and update interval (# env steps).
Boil down your slides (in mind); nobody will parse a slide that is full of text.
BC is a good baseline if a complete trajectory presents, and if the initial position is of small variance. On discrete MDP, the best BC result is simply counting transitions and do random action if current state is never witnessed.
Continuing from 79, you should be pessimistic in offline RL / IL, so that you policy does not astray from what you have witnessed.
Wasserstein distance is a much “weaker” distance than f-divergences (e.g. KL divergence), which means in many scenario, the f-divergence method will give either a infinite value / being invalid, or being uncontinuous, or losing a gradient. Intuitively, this is because Wasserstein distance represents a much “weaker” norm in topology (see WGAN paper for details). Wasserstein-1 distance is also called earth mover’s distance.
If you have a score model which estimates the gradient of log probability of some unknown distribution, you can sample from the distribution using Langevin dynamics. This is score matching method.
The notion of spaces:

A normed space is a special metric space, which means elements have the notion of “large / small” by norm. A complete normed space is called a Banach space; by “complete” it means the limit of a Cauchy sequence is still in the space (counterexample: rational number Q)

A Euclidean space is a finite-dimenisional linear space with an inner product. A Hilbert space is an expansion of Euclidean space; it means a complete inner product space, but can be infinite dimensional and not confined to real numbers.

By Mercer theorem, any semi-positive definite function can be a kernel function.
You should not put plotting plt and ax inside an object and make it a property of other object, especially one that is not a singleton (e.g. solver class for a ML solution). Remember that plt settings are global; object duplication will ruin your plotting.
Be wary of the subtle constraints on Lagrange multipliers when you try to derive the dual problem (without them the optimal value could be unbounded; e.g. when you derive dual for linear programming). You should be extra careful when you only apply Lagrange on part of the constraints; when the other part of constraints is a bounded closed set, the problem might be much harder to discover.
Be very careful when you try to explain something with a toy example but change cases (e.g. when talking about something for continuous space, use discrete space as an example).
In a rebuttal, write in a way that is considerate for the reviewers:

1) Answer the question clearly with a few words at the beginning of the problem;

2) Do not show them that you are lazy typing, but you are helping them (e.g. for brevity -> for readability);

3) If possible, do not force the reviewer to get back to the paper. List the points briefly besides reference to the paper. Similarly, avoid reference to the other reviewers;

4) Do not simply write “We will change this”, but show them “how we will change this” (and in the case where you can update pdf, do it) and invite them for advice on further modification;

5) Reply after everything is ready, but immediately beyond that point.

Do not assume that the battle is over until the authors are not expected to say anything (e.g. reviewer-metareviewer discussion). For NeurIPS, finishing rebuttal period is only half-way there; there are still much work to do at author-reviewer discussion period.
For L-BFGS, increasing the history hyperparam will increase the stability of iteration. Also, you should use line search with ‘strong wolfe’ in pytorch.

L-BFGS needs optimizer.step(closure()) where closure() gives the loss function. It might be invoked multiple times in one timestep, sometimes with gradient and sometimes without. That’s why you will sometimes get backward second time error if you do not put everything in the closure() function. Here are two examples: https://gist.github.com/tuelwer/0b52817e9b6251d940fd8e2921ec5e20#file-pytorch-lbfgs-example-py-L27; http://sagecal.sourceforge.net/pytorch/index.html.

Be very careful when you try to generate dataset with some tricks (e.g. manipulate the distribution so that the states are guaranteed to be covered) and handling the “default value” for corner case. They might lead to very counter-intuitive behavior if not considered properly.
Gurobi max (gp.max_) operator can only take constant and variable (that is, no expressions such as x+1) as of Sept. 2022.
torch.nn.parameter.Parameter(a.double(), requires_grad=True) is correct; torch.nn.parameter.Parameter(a, requires_grad=True).double() is not.
Wasserstein-1 distance with hamming distance metric is total variation distance.
If you are doing stochastic gradient descent by sampling some pairs of variables (e.g. uniformly sampling (i,j) for x_i+x_j), you’d better sample each pair independently, instead of sampling two state uniformly and then select all pairs of chosen states. In the latter case, you cannot break the correlation between (i,j) and (i,*), (i,j) and (j, *), as they are always updated together.
While there is a rule of thumb that choices the learning rate, it really depends on your scale of the loss and batch size. Be open to rare learning rates when you finetune your algorithm.
Remember to “git add” your new file when you are using git to do version control, especially writing script to auto-commit things. otherwise, you may find that your modifications are all untracked.
You can use gym.spaces.MultiBinary(n=10) for one-hot observation space.
torch.multinomial and torch.Categorical supports batch sampling, which means you only need to get a input of batchsize * n tensor and it will sample batchsize different groups of samples for you. You don’t have to go over the whole array! And use F.one_hot(m, num_classes=n) if neccessary.
Some points on making slides:

1) Text size should be unified across all slides and inside figure. Don’t be lazy and just using existing figures; draw a nice figure that expands as your presentation progresses. And change text instead of font size to fit in spaces.

2) The shorter the presentation is, the more rigorous logic your slides must have because you don’t have too much time to do “overviews” to remind people what you are discussing.

3) You can align the elements on your slides by choosing items and selecting “align”.

4) You can make changing text w.r.t. time by animation with time delay. No need to make a video using video edit tools.

5) Use animations to avoid making your slides to be overwhelming. Let the slide be filled gradually as your speak progresses.

6) Always introduce math symbols first before using them, even the most common-sense ones in the subfield (e.g. state s in reinforcement learning). You should use as less symbols as possible in a short presentation.

7) Colors and shapes matter; they can be a strong indicator in the figure. Ask yourself why use this color for this color / this shape?

self-made dataloader based on torch.randperm could be much faster than torch dataloader, especially if the data is stored in dict for each dataset. torch dataloader need to concatenate them every time and that can be very slow.
If you are trying to overfit behavior cloning on a small dataset to debug, remember to add variance lower bound (e.g. clip / tanh) to avoid spikes.
If you are training an action distribution on a closed set (e.g. in behavior cloning in gym environment), and you are using Gaussian / GMM / normalizing flow. One thing you could try to optimize log probability a lot is to use tanh to converge your output into a bounded one. And probability tractable will still be tractable.
Wasserstein distance in the Rubinstein-Kantorovich form assumes the underlying metric to be Euclidean, unless the definition of 1-Lipschitz is modified.
The sample complexity of Wasserstein distance is bad, but for MMD it is good. Sinkhorn Divergence stands between them, and have a corresponding sample complexity. They are all called integral probability methods.
Do not undo commit in github desktop unless you are absolutely certain! Undoing commit makes you lose all the progresses during this commit.
You need to use clf() instead of cla() to remove the old colorbar in your last figure in matplotlib. However, after that you need ax = fig.add_subplot() to re-insert subfigures in order to draw anything more on the canvas.
If you feel lost about why your method is not working while the baseline is, a way out is to implement your method inside the codebase of the baseline. In that way, you can make your method to be as similar to the baseline as possible, and to rule out the factors that does not matter one by one.
Be bold and aggressive when you first try to tune your algorithm; often it takes longer than expected to train / bolder choice of hyperparameter than your expecation to make your algorithm work.
Do read the experiment details of your baselines, and make sure of how they set up their experiment, especially what do they do to their dataset (e.g. merging). You do not want to waste time on settings that is unnecessarily harder / easier than prior work.
When you don’t know where is the problem of your algorithm, go and check if your dataset has problems.
If you are working optimizations of f-divergences on a probability simplex, consider Fenchel conjugate; consider Donsker-Varadhan representation and https://people.lids.mit.edu/yp/homepage/data/LN_fdiv.pdf Thm 7.14.
Continuing from 112: when considering relaxing optimization (e.g. use Lagrange multiplier to relax some constraints), relax as less constraint as possible (as long as you can solve it). Relax to probability simplex is better than relax to positivity constraint.
remember to set CUDA_LAUNCH_BLOCKING=1 whenever you meet a device-side assert triggered error.
If you don’t know what parameter to tune, try to do the following two things: 1) check very closely on your direct baseline to see how they solve the problem; 2) retry factors excluded before last bug fix. Sometimes bug fixes will make factors behave very differently and you may overlook some crucial factors.
For RL evaluation, you should try to use deterministic action (mean as output) as stochastic ones are often with fairly high variance and cannot do well, especially in those environments requiring accurate actions.
If you need to send your computer to repair, make sure you have copied everything you need out of it. Especially the private keys for the server.
If you need to copy datasets to different folders on your server, consider soft links; this saves your disk space and frees you from copying everytime you change your dataset.
If you were to build up a desktop, remember that do not throw the boxes until you have lighten up the machine. There might be some important information or some material (e.g. screws, cables) in the boxes.
When building up your desktop, remember to observe the minimal principle: use only as least as possible components to light up your mainboard first. Do not haste to install the extra memory / disk / graphics card. However, you should always make room for your GPU at the very beginning.
Make sure to check the debugging light and code on your mainboard to figure out the problem.
When swapping an element in an array and its index in python, be very careful: a[a[0]], a[0] = a[0], a[a[0]] might not behave the expected way. A better choice is to use a, b = copy.deepcopy(b), copy.deepcopy(a), or use the tmp variable.
(Pytorch official hint) If you need to move a model to GPU via .cuda() , please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.
GLFW might be problematic on headless machines (i.e. servers); to fix this, try set MUJOCO_GL environment variable to egl or osmesa.
When you are using np.linalg.norm, make sure you know you are operating on a vector or matrix; they have very different behavior.
If you want to make modifications to a decision transformer, make sure your sequence is still one-way logical (always induce the terms on the right from the terms on the left).
The problem of beam search is that top-K choices may only include a very small portion of probability. You need to set K to be very large to include most possibility.
If torch.save cannot save because of errors like some local variables in class, try cloudpickle.dump(). example: cloudpickle.dump(agent, open(“model/”+NAME+”/agent-iter”+str(i)+”.pkl”, mode=”wb”))
Remember that decision transformer is a transformer, so as long as the timestep, state and attention mask are correctly matched and in the correct order, it does not matter where you put the paddings (front or latter). Remember, decision transformer can be used for unordered set once the positional encoding is removed, so it does not really matter if it is front-padded or latter-padded.
GPT2 in huggingface is causal, and decision transformer is based on GPT2. It does not matter what padding action you are putting into the place where attention mask is 0.
Remember that AWR’s implementation has a weight clipping of 20, and it is normal that initially there are either weight 20 or 0. Also, AWR is quite sensitive to buffer size; buffer size too small (50K recommended) will make the algorithm overfit to the dataset.

132.

import time
from tqdm import tqdm
lst, lst2, lst3, lst4 = [], [], [], []

for i in tqdm(range(100000)):
    lst.append(torch.zeros(1, 100))
    lst2.append(torch.zeros(100))

t0 = time.time()
for i in tqdm(range(100000)):
    lst3.append(lst[i][[0], :])
t1 = time.time()

for i in tqdm(range(100000)):
    lst4.append(lst2[i].reshape(1, 100))
t2 = time.time()

print("[0]:", t1 - t0, "reshape:", t2-t1)

The result is [0]: 1.9272778034210205 reshape: 0.3856058120727539. The latter is 5x faster than the former! (torch.stack is even faster!)

When you find that the training curve is strange, make sure to check whether you sampling process is fine; your program might only be trained on a small subset due to code bug.
remember torch.distributions.Normal takes standard deviation as input, but torch.distributions.multivariate_normal.MultivariateNormal takes variance (covariance) as input!
remember to check the original code by the author; there might be some special tricks in it or different hyperparams that are not specified in the paper.
Be very careful when you use xxx if yyy else zzz; adding () at the edge of the expressions is always a good practice.
A good way to write squashed Gaussian is through torch.distributions (from online DT):

class TanhTransform(pyd.transforms.Transform):
    domain = pyd.constraints.real
    codomain = pyd.constraints.interval(-1.0, 1.0)
    bijective = True
    sign = +1

    def __init__(self, cache_size=1):
        super().__init__(cache_size=cache_size)

    @staticmethod
    def atanh(x):
        return 0.5 * (x.log1p() - (-x).log1p())

    def __eq__(self, other):
        return isinstance(other, TanhTransform)

    def _call(self, x):
        return x.tanh()

    def _inverse(self, y):
        # We do not clamp to the boundary here as it may degrade the performance of certain algorithms.
        # one should use `cache_size=1` instead
        return self.atanh(y)

    def log_abs_det_jacobian(self, x, y):
        # We use a formula that is more numerically stable, see details in the following link
        # https://github.com/tensorflow/probability/commit/ef6bb176e0ebd1cf6e25c6b5cecdd2428c22963f#diff-e120f70e92e6741bca649f04fcd907b7
        return 2.0 * (math.log(2.0) - x - F.softplus(-2.0 * x))


class SquashedNormal(pyd.transformed_distribution.TransformedDistribution):
    """
    Squashed Normal Distribution(s)

    If loc/std is of size (batch_size, sequence length, d),
    this returns batch_size * sequence length * d
    independent squashed univariate normal distributions.
    """

    def __init__(self, loc, std):
        self.loc = loc
        self.std = std
        
        # print("shape:", loc.shape, std.shape)
        
        self.base_dist = pyd.Normal(loc, std)

        transforms = [TanhTransform()]#  [] #
        super().__init__(self.base_dist, transforms)

    @property
    def mean(self):
        mu = self.loc
        for tr in self.transforms:
            mu = tr(mu)
        return mu

    def entropy(self, N=1):
        # sample from the distribution and then compute
        # the empirical entropy:
        x = self.rsample((N,))
        log_p = self.log_prob(x)

        # log_p: (batch_size, context_len, action_dim),
        return -log_p.mean(axis=0).sum(axis=2)

    def log_likelihood(self, x):
        # log_prob(x): (batch_size, context_len, action_dim)
        # sum up along the action dimensions
        # Return tensor shape: (batch_size, context_len)
        return self.log_prob(x).sum(axis=2)

A simple way to write a dataset for iteration:

class RepeatedDataset:
    def __init__(self, datas, batch_size, start_with_random=True):
        self.datas = []
        for data in datas: # list of arrays with the same first dimension.
            self.datas.append(data.clone())
        self.counter, self.idx, self.batch_size = 0, torch.randperm(self.datas[0].shape[0]), batch_size
        if start_with_random:
            for _ in range(len(self.datas)):
                print("shape:", self.datas[_].shape)
                self.datas[_] = self.datas[_][self.idx]
    
    def __len__(self):
        return self.datas[0].shape[0] // self.batch_size    
    
    def getitem(self):
        if self.counter + self.batch_size > len(self.idx):
            self.counter, self.idx = 0, torch.randperm(self.datas[0].shape[0])
            for _ in range(len(self.datas)):
                self.datas[_] = self.datas[_][self.idx]
        ret = []
        for _ in range(len(self.datas)):
            ret.append(self.datas[_][self.counter:self.counter+self.batch_size])
        self.counter += self.batch_size
        """
        print(self.counter, self.counter+self.batch_size)
        
        for _ in range(len(self.datas)):
            print(self.datas[_][self.counter:self.counter+self.batch_size])
        """
        if len(self.datas) == 1: return ret[0]
        else: return ret

You should not use multiprocessing in dataloader while loading tensors, or you might get a CUDA initialization error. Make sure that your torch.dataloader only loads numpy arrays instead of tensors. (besides, if the data is on GPU, why bother loading it via multiprocessing?)
use dataload = iter(dataloader); next(dataload) to get the next batch in the torch dataloader. (do not use next(iter()) as it is very slow!)
You must do left-padding for pretrained LLM models, because LLMs are decoder-only architectures and are not trained to continue from padding tokens! (https://huggingface.co/docs/transformers/main/en/llm_tutorial#wrong-padding-side)
If your self-implemented SAC algorithm is diverging, you should check whether the entropy sign is correct. If the entropy term is wrong, then the Q value will certainly diverge (which is different from other cases where the entropy is not involved in the TD function).
next(it), it=iter(dataloader) is very slow, probably because it does not use the parallelization of the torch dataloader; try iterate in a for loop instead.
If you find that the CPU usage of pytorch code is very high, try use torch.set_num_threads(1) (to reduce thread communication cost) or pin_memory=False (if you have ever explicitly set it to true).
When making slides, the front “dot” recommendation: unicode 2022 (in custom), 100% height
Be very careful when you adapt random agents to deterministic algorithms (e.g. TD3 to ODT). You probably run the risk of not initiating exploration noise, which does not have to exist when it was a stochastic agent.
Be close to standard D4RL format; it is better that your program directly read from get_dataset() such that you have better reproducibility.
Transformer RL agents might have very different hyperparameters from MLP ones (e.g. critic learning rate).
If you are confronting weird critic divergence, check your data; if not a single state is “terminal” (i.e. all timeout), remember to set one to terminal.
If your RL agent is diverging due to strange reasons, try layernorm on the critic. However, adding layernorm to the critic is not always the best choice; sometimes (e.g. mujoco) it slows down the learning process, but sometimes (e.g. adroit) it is magical.
If you are wondering how people solve antmaze: they (CQL, IQL) sub reward by 1, making a sparse reward env becoming a dense one.
Make sure that you use \left and \right before the parentheses for complicated contents in the formula (e.g. \exp\left(\frac{a}{b}\right) ).
Remember that “by [4]” is not correct in writing papers; instead, you should write “by xxx et al. [4]”.
Remember to use \eqref instead of \ref for equations.
Remember to use vector graph (i.e., pdf) for figures in the paper.
When you are updating posts in Jekyll, make sure that you add posts from a past point of time. The future posts will be skipped by Jekyll. To check this, use jekyll build –verbose.
Remember that openreview requires “_” for latex formulas that are “_” in overleaf. https://docs.openreview.net/reference/openreview-tex/common-issues-with-latex-code-display
Check for the success file when you use hdfs to download items.
You should be very careful when you are building a benchmark about data quality.
If you are encountering problems such as “ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (labels in this case) have excessive nesting (inputs type list where type int is expected).” but you are sure you processed it, check if it is the problem of your new dataset containing previous items that has not been processed. https://github.com/huggingface/transformers/issues/15505
How to set up a language model for regression with transformer library: https://discuss.huggingface.co/t/how-to-set-up-trainer-for-a-regression/12994 (i.e. set up a single label for classification and set loss to be MSE).
BitsandBytes seem ot handle the problem of bfloat16 vs. float by its own. When you are not using it to finetune LLMs, be very careful about the dtype of the parameters vs. data; there will be subtle errors. (In other words, you should check this if you met subtle errors in finetuning LLMs).
Remember that most BERT’s context length is only 512 tokens!
When you are using dmc2gym to finetune image-based RL environments on headless machine, you might meet the problem of X11 host error by glfw. Remember to set rendering with osmesa / egl!
Remember to do pixel normalization (0 to 255 -> -1 to 1) when you are doing visual RL!
Process reward models are not as beautiful as it seems, though it might be attractive for RL people - read deepseek R1 paper for details.
Do not inject too many human-made rules in the early stages of LLM training pipeline (e.g. careful selecting data for pretraining), which might limit the final upper bound of the model.
If you are using sglang or vllm, try to set –mem-fraction-static (sglang) / –gpu-memory-utilization (vllm) to 0.8 as a good starting point for both. You should not set this value too small for sglang (you can do this for vllm though it will be inefficient). Be sure to check whether there are zombie processes taking your GPU memories!
This is wrong: df[‘input’] = str(df[‘input_output’].apply(lambda x: json.loads(x)[‘inputs’])); instead, you should use f[‘input_output’].apply(lambda x: json.loads(x)[‘outputs’]).apply(lambda x: str(x)).
Be sure to first backup any file that is possibly being written by the current running scripts when you are killing the script. You don’t want your data to be corrupted!
Be very careful when you are consistently writing logs in a github repo; if you keep writing it without configuring gitignore properly, you will see the running becomes slower and slower because of the github lfs tracker in the background.
When trained on a particular task, even 7b small models can exhibit quite impressive performance. If you feel your model has low reward during RL on the task (even if it is fairly complicated), go and check your models.
The best way to debug LLM training is to look at the output it produce.
Be sure to check your config if you are using gradient checkpointing first when you got a GPU OOM error during training!
Be very careful when you use numbers for the start of the name of a folder; be sure to check if there is any exec() that uses the folder name as a function (e.g. Berkeley function call leaderboard). Similarly, be very careful on the choice of “_” vs. “-“.
Pretrain determines the upper bound of a LLM, and RL makes the best out of the model. Do not apply too much man-made constraints in training data, especially in early training stages of LLM.
There is some subtle bug with deepspeed zero with Qwen that causes nan. You might want to use FSDP instead.
Always remember to check temperature if you are training LLM and find that its generalized result does not make sense.

Add this to the end of ~/.bash_profile if you have trouble in tmux finding conda after installing it:

# If running bash
if [ -n "$BASH_VERSION" ]; then
    # Include .bashrc if it exists
    if [ -f "$HOME/.bashrc" ]; then
   . "$HOME/.bashrc"
    fi
fi 

There might be a generalization gap between rendered pictures and real-life pictures by VLM.
A good way to debug whether the output discrepancy (at temperature 0) is caused by bug or hardware: check logits and raw input token / pixel values; try to curate a very small, homogenous dataset and look at the output. Oftentimes even with temperature=0, you cannot replicate everything due to GPU / precision / different package issues / using liger. But this usually just case 1-2% performance difference. Also, if you want to evaluate your model out of the transformer training loop, be careful if model.eval() is missing.
It is always a good idea to double-check whether the distribution and aggregation of your dataset is correct especially when you are evaluating on multiple GPUs. A good way to do this is to output the total count of items.
DO NOT train a model with flashattention but load and evaluate it and use regular SDPA attention. This could cause very large performance difference!
How to aggregate result when using torch.distributed run:

torch.distributed.barrier() # can also use "with barrier_guard(before=True, after=True):""
local_rollouts = self.collect_rollouts(iteration, is_evaluation=True)
torch.distributed.barrier()
all_rollouts = [None] * torch.distributed.get_world_size()
torch.distributed.all_gather_object(all_rollouts, local_rollouts)
if rank_zero_only(self.rank):
    all_rollouts = list(itertools.chain.from_iterable(all_rollouts))
    self.log_metrics(all_rollouts, is_ood_eval=True)

How to debug multi-GPU LLM job:

Follow point 181 and 182.
check if the temperature is set to 0 / do_sample is False. there could be minor difference even with these, but the performance should not be wildly different.
check point 183.
check model precision and quantization: are you using float32 or bfloat16?
check input sizes: is it too large? (e.g. too large image for vlm)?
output input token and output logits for a single data for comparison
Be very careful to check whether your result is properly aggregated across GPUs instead of showing result only from one GPU.
check padding length. This should not lead to that different performance but still worth checking.
check if your model.eval() is on.
check environment version (e.g. version of python and transformer).
A 1-2% performance difference is OK; a 5% performance difference is alarming.

Your git merge only works with local target branch. Make sure you have git pulled before you merge.
How to build your custom evaluation over HF evaluate(): (+deepspeed zero2)

class QwenSFTTrainer(Trainer):
    """
    def training_step(self, model, inputs, num_items=None):
        # Run standard training step
        model.train()
        inputs = self._prepare_inputs(inputs)

        with self.compute_loss_context_manager():
            loss = self.compute_loss(model, inputs)

        if self.args.gradient_accumulation_steps > 1:
            loss = loss / self.args.gradient_accumulation_steps

        loss.backward()

        # Log gradient stats
        max_grad = 0.0
        has_nan = False
        has_inf = False

        for name, param in model.named_parameters():
            if param.grad is not None:
                grad = param.grad
                if torch.isnan(grad).any():
                    print(f"[WARNING] NaN in gradients of parameter: {name}")
                    has_nan = True
                if torch.isinf(grad).any():
                    print(f"[WARNING] Inf in gradients of parameter: {name}")
                    has_inf = True
                param_max = torch.max(torch.abs(grad)).item()
                if param_max > max_grad:
                    max_grad = param_max

        print(f"[Max Gradient Element] {max_grad:.6f} | NaN: {has_nan} | Inf: {has_inf}")

        return loss.detach()
    """
    def __init__(self, *args, **kwargs):
        self.image_path = kwargs.pop("image_path", None)
        super(QwenSFTTrainer, self).__init__(*args, **kwargs)
        #print("args:", args)
        #print("kwargs:", kwargs)
        training_args = kwargs.get("args", None)
        # print("training_Args:", training_args, "image_path:", self.image_path)
        self.experiment_name = getattr(training_args, "experiment_name", "trained")
        if training_args is not None:
            self.eval_reward_data_path = getattr(training_args, "eval_reward_data_path", None)
        else:
            self.eval_reward_data_path = None
        with open(self.eval_reward_data_path, "r") as f:
            self.eval_reward_data = json.load(f)
            # print(self.eval_reward_data_path, str(self.eval_reward_data[:100]))
            if training_args.override_system_prompt != 'no':
                with open(training_args.override_system_prompt, "r") as f:
                    self.system_prompt = f.read()
                    print("system prompt overrided!")
            else: self.system_prompt = self.eval_reward_data['system_prompt']
            self.eval_reward_data = self.eval_reward_data['data']# [:BS*N_GPU]
            # assert len(self.eval_reward_data) == 256, "Error!"
            keys = self.eval_reward_data[0].keys()
            self.eval_reward_data = {k: [str(self.eval_reward_data[i][k]) for i in range(len(self.eval_reward_data))] for k in ['ground_truth_location', 'answer', 'user_prompt', 'partial_reward_perceive', 'type', 'image_path', 'tag']}
            self.eval_reward_data = Dataset.from_dict(self.eval_reward_data)
        

    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str="eval"):
        metrics = super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
        
        # Create distributed sampler for proper data sharding across GPUs
        # Use the same pattern as HuggingFace Trainer's _get_eval_sampler
        if self.args.local_rank != -1:  # Distributed training
            sampler = torch.utils.data.DistributedSampler(
                self.eval_reward_data,
                num_replicas=self.args.world_size,
                rank=self.args.process_index if hasattr(self.args, 'process_index') else self.args.local_rank,
                shuffle=False,
            )
        else:
            sampler = torch.utils.data.SequentialSampler(self.eval_reward_data)
        
        custom_eval_dataloader = DataLoader(
            self.eval_reward_data,
            batch_size=BS,
            sampler=sampler,
            collate_fn=list_data_collator,
            drop_last=False,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
        )
        
        # For DeepSpeed, we don't need to prepare the dataloader with accelerator
        # Just move it to the correct device if needed
        
        # Debug info
        num_batches_per_gpu = len(custom_eval_dataloader)
        if hasattr(sampler, '__len__'):
            num_samples_per_gpu = len(sampler)
        else:
            num_samples_per_gpu = len(self.eval_reward_data)
            
        if self.args.local_rank <= 0:  # Print only from main process
            print("Starting custom evaluation on all GPUs...")
        
        print(f"GPU {self.args.local_rank}: "
              f"Full dataset size = {len(self.eval_reward_data)}, "
              f"Samples on this GPU = {num_samples_per_gpu}, "
              f"Batches on this GPU = {num_batches_per_gpu}")
        
        # Run evaluation on ALL processes (each GPU processes its shard)
        custom_metrics = eval_reward_withCI(
            custom_eval_dataloader, 
            self.model, 
            self.tokenizer, 
            system_prompt=self.system_prompt, 
            global_step=self.state.global_step, 
            image_path=self.image_path, 
            name_prefix=self.experiment_name
        )
        
        # Add prefix to metrics
        custom_metrics_prefixed = {f"{metric_key_prefix}/{k}": v for k, v in custom_metrics.items()}
        
        # Convert metrics to tensors and aggregate across GPUs
        if self.args.local_rank != -1:  # Only aggregate in distributed setting
            aggregated_metrics = {}
            
            for k, v in custom_metrics_prefixed.items():
                if isinstance(v, (int, float)):
                    # Convert to tensor on the correct device
                    if hasattr(self.model, 'device'):
                        device = self.model.device
                    elif torch.cuda.is_available():
                        device = torch.device(f'cuda:{self.args.local_rank}')
                    else:
                        device = torch.device('cpu')
                        
                    tensor_val = torch.tensor(float(v), device=device, dtype=torch.float32)
                    
                    # All-reduce across all processes
                    if "count" in k.lower() or "total" in k.lower():
                        # Sum for count/total metrics
                        torch.distributed.all_reduce(tensor_val, op=torch.distributed.ReduceOp.SUM)
                    else:
                        # Average for other metrics
                        torch.distributed.all_reduce(tensor_val, op=torch.distributed.ReduceOp.SUM)
                        tensor_val /= self.args.world_size
                    
                    aggregated_metrics[k] = tensor_val.item()
                else:
                    # For non-numeric metrics, just use the value from rank 0
                    aggregated_metrics[k] = v
            
            # Synchronize all processes
            torch.distributed.barrier()
            
        else:
            # Single GPU case
            aggregated_metrics = custom_metrics_prefixed
        
        # Log only from the main process (rank 0)
        if self.args.local_rank <= 0:
            print("Aggregated custom metrics:", aggregated_metrics)
            self.log(aggregated_metrics)
        
        # Update the main metrics dictionary
        metrics.update(aggregated_metrics)
        
        return metrics

Be careful with single element and list of elements in datasets. For example, you have

allowed = set(allowed_list)
ds_sub = ds.filter(
    lambda batch: [v in allowed for v in batch["label"]],
    batched=True,
    num_proc=4,        # adjust to CPU cores
)

for num_proc>1 but

ds_sub = ds.filter(lambda ex: ex["label"] in allowed)

for single element.

Code for retrieving point cloud from image and mask: (remember to check whether the “default direction” is [0, 0, 1] or [0, 0, -1]!)

def run_custom(self, image_filename, image, masks, output_dir):
        import math
        import numpy as np
        import os
        import cv2
        from pathlib import Path
        import open3d as o3d

        scene_pcd, depth_map_np, focal_val, extrinsic_1, intrinsic_1 = self.create_point_cloud_from_model(image) # PIL image
        original_points = np.asarray(scene_pcd.points)
        #print("extrinsic:", extrinsic_1)
        R, t = extrinsic_1[0, :3, :3].cpu().numpy(), extrinsic_1[0, :3, 3].cpu().numpy()
        camera_position = -R.T @ t
        camera_lookat = (R.T @ np.array([[0.0], [0.0], [1.0]])).reshape(-1)
        K = intrinsic_1[0].cpu().numpy() # The 3x3 intrinsic matrix
        point_cloud_filepaths = []

        points = np.asarray(scene_pcd.points)   # shape (N, 3)
        colors = np.asarray(scene_pcd.colors)   # shape (N, 3)

        output_pointcloud_dir = os.path.join(output_dir, "pointclouds")
        Path(output_pointcloud_dir).mkdir(parents=True, exist_ok=True)


        W, H = image.size
        K_use = K.copy()
        
        Hm, Wm = depth_map_np.shape
        sx, sy = W / Wm, H / Hm
        K_use[0,0] *= sx          # fx
        K_use[1,1] *= sy          # fy
        K_use[0,2] *= sx          # cx
        K_use[1,2] *= sy          # cy
        
        
        print("==================")
        for idx, mask_img in enumerate(masks):
            mask_array = np.array(mask_img, dtype=np.uint8) # 255 * np.ones_like(mask_img, dtype=np.uint8) #
            print(f"image size: ({W}, {H}), mask array: {mask_array.shape}, depth map: {depth_map_np.shape}")
            if mask_array.ndim != 2: continue

            if (mask_array.ndim == 2) and ((mask_array.shape[0] != H) or (mask_array.shape[1] != W)): # masks is a list of 2D array (H, W)
                print("resizing...")
                mask_array = cv2.resize(mask_array, (W, H), interpolation=cv2.INTER_NEAREST)
            
            else: print("not resizing")

            p_cam = (R @ original_points.T + t[:, np.newaxis]).T
            # Project points from camera to image coordinates
            p_img = (K_use @ p_cam.T).T # Shape: (N, 3)
            print("p_img:", p_img.shape, "K:", K, "K_use:", K_use, "R:", R, "t:", t)

            valid_depth_indices = p_img[:, 2] > 0
            u = p_img[valid_depth_indices, 0] / p_img[valid_depth_indices, 2]
            v = p_img[valid_depth_indices, 1] / p_img[valid_depth_indices, 2]

            # Check which points fall inside the mask
            # Keep only points that project within the image boundaries

            bounds_indices = (u >= 0) & (u < W) & (v >= 0) & (v < H)

            # Get the integer pixel coordinates for valid points
            u_int = u[bounds_indices].astype(int)
            v_int = v[bounds_indices].astype(int)

            original_valid_indices = np.where(valid_depth_indices)[0][bounds_indices]

            # Look up the mask values at the projected coordinates
            mask_values = mask_array[v_int, u_int]

            # The final indices are where the mask is non-zero
            valid_mask_indices = original_valid_indices[mask_values.astype(bool)]

            if len(valid_mask_indices) == 0:
                print(f"[WARNING] Mask {idx+1} produced no valid points, skipping.")
                continue

            masked_points = points[valid_mask_indices]
            masked_colors = colors[valid_mask_indices]
            if masked_points.size == 0:
                print(f"[WARNING] No points left after indexing for mask {idx+1}, skipping.")
                continue

            masked_pcd = o3d.geometry.PointCloud()
            masked_pcd.points = o3d.utility.Vector3dVector(masked_points)
            masked_pcd.colors = o3d.utility.Vector3dVector(masked_colors)
            if masked_pcd.is_empty():
                print(f"[WARNING] Empty PCD for mask {idx+1}, skipping.")
                continue

            pointcloud_filepath = os.path.join(
                output_pointcloud_dir,
                f"pointcloud_{Path(image_filename).stem}_{idx}.pcd"
            )

            self.save_pointcloud(masked_pcd, pointcloud_filepath)
            point_cloud_filepaths.append(pointcloud_filepath)

        normed_pcd_filepath = os.path.join(
            output_pointcloud_dir,
            f"pointcloud_{Path(image_filename).stem}_whole.pcd"
        )
        self.save_pointcloud(scene_pcd, normed_pcd_filepath)

        return camera_position, camera_lookat, normed_pcd_filepath, point_cloud_filepaths, False, depth_map_np, focal_val # canonicalized = false

Remember that PIL image has W, H = image.size, but it may be the reverse for the order of dimension in numpy. Be very careful when dealing with the mask.
Check for model’s confidence to examine whether it is “guessing” answers on multiple choice questions.
The way to fetch every branch from remote and track them, create local branch if it does not exist:

git branch -r | grep origin/ | grep -v -- '->' | sed 's|origin/||' | xargs -I{} git branch --track {} origin/{}

(Note the – in grep is necessary.)

cherry-pick all non-conflict commits:

#!/bin/bash

# Ensure you are on the correct branch before running
# git checkout my-feature-branch

# Get the list of commits in main that are not in the current branch (HEAD)
# --reverse ensures they are processed in chronological order (oldest first)
commits_to_pick=$(git rev-list --reverse HEAD..main)

if [ -z "$commits_to_pick" ]; then
  echo "Your branch is up to date with main. Nothing to do."
  exit 0
fi

echo "Starting to cherry-pick commits from main..."

# Loop through each commit hash
for commit in $commits_to_pick
do
  echo "--------------------------------------------------------"
  echo "Attempting to cherry-pick commit: $(git log --format='%h %s' -n 1 $commit)"

  # Attempt to cherry-pick the commit.
  # The '-n' or '--no-commit' option is not used so that successful picks are committed automatically.
  if git cherry-pick $commit &> /dev/null; then
    echo "✅ Successfully cherry-picked $commit"
  else
    echo "❌ Conflict detected on commit $commit. Skipping it."
    # Abort the cherry-pick to reset the working directory to a clean state
    git cherry-pick --abort
  fi
done

echo "--------------------------------------------------------"
echo "Process complete. All non-conflicting commits have been applied."

when you are using ray, you need to check whether the worker sees the same file as the driver. This is particularly confusing if you are working on the same node.
Be careful of your GPU utilization ratio (for VLLM) when you are training LLM (e.g. with verl). If you are using smaller GPUs, you should slightly decrease this utilization ratio to make room for pytorch.
Qwen models might meet illegal memory access issue; try to add these when running your experiments:

export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory'
export CUDA_COREDUMP_FILE="/persistent_dir/cuda_coredump_%h.%p.%t"

export VLLM_ATTENTION_BACKEND=XFORMERS # important!

be very careful when using if {data_structure}; check bool() explicitly. For example, dataproto in verl could silently be false when converted to bool, causing “if …” to fail.
When you check logprob between ref model and actor, be very careful with your temperature, which could scale logprob!
How to find indentation error in a big project:

from pathlib import Path
import sys
for p in sorted(Path('.').rglob('*.py')):
    try:
        compile(p.read_text(encoding='utf-8'), str(p), 'exec')
    except IndentationError as e:
        print(f"\nIndentationError in {p}:{e.lineno}:{e.offset or ''} -> {e.msg}")
        # Show the bad line:
        try:
            line = p.read_text(encoding='utf-8').splitlines()[e.lineno-1]
            print(f"  {line}")
        except Exception:
            pass
        sys.exit(1)
print("No IndentationError found.")

If the training hangs, you could consider:

tensor parallel and NCCL. Is it the problem of xformers / tensor parallel size with too small overhead for communication?
Do you have the same number of forward / backward across different ranks for training?

specifically, for verl which hangs at NCCL version, set export NCCL_P2P_DISABLE=1 as suggested by this: https://github.com/volcengine/verl/issues/597

How to check which card has ECC error:

nvidia-smi -q -d ECC | egrep "GPU [0-9]|Single Bit|Double Bit|Uncorrectable" -n
nvidia-smi --query-gpu=index,pci.bus_id --format=csv,noheader