Walter’s Approach

Nothing is f***ed here dude. Nothing is f***ed.

In late 2022, OpenAI released ChatGPT. By now this is old news, but what a great learning tool for those seeking further knowledge.
Having very young boys at the time, I did what many people did and used ChatGPT to help me write several 10-page children’s books. They are centered around the funny conversations and adventures of my own children, so it will be fun to dig those up later.

During this time, the fascination of how all this was even possible sunk in. This has resulted in learning that Python can pretty much do anything and with 98% accuracy, I can tell you that Leo did not survive the Titanic.

Fast-forwarding a couple months, I have a little better handle on the machine learning world and how helpful it can be provided the right context. I felt confident I could read the whitepaper “Attention Is All You Need” and understand it. Let’s just say it is on the re-read list. There are also other many, many valuable resources and in the scope of personal learning, this is just scratching the surface.

I was watching a short lecture from a MIT professor discussing AI, machine learning, ChatGPT, etc… and he mentioned that field of study is still very new and that there is a lot of unknown, but there is speculation around the relations to thermodynamics. I thought this was even more interesting. Enter in Walter’s Approach.

Please keep in mind, this article is supposed to be fun and if there is anything mathematically wrong or something is not correct, feel free to contact me here.

After the Titanic, I tried learning more about using different Models on different datasets. MLP, CNN with FashionMNIST, MNIST, KMNIST, CIFAR-10, CINIC-10.

When training a CNN model, an optimizer is responsible for updating the model’s weights and biases in order to minimize the loss function. The loss function measures the difference between the model’s predictions and the ground truth labels. The optimizer uses the gradient of the loss function with respect to the model’s parameters to update the parameters in a direction that will reduce the loss.

Walter’s Approach goes on the assumption that nothing is f***ed… until it is.

The PhaseShiftOptimizer (PSO) class is used to find the optimal weights in a neural network model. The correlation between this loss function and the Enthalpy of vaporization can be understood in the following manner:

Optimal Weight: The optimal_weight parameter represents a threshold value where the loss function changes behavior. This can be analogous to the critical temperature at which a substance transitions from liquid to vapor. In the context of Enthalpy of vaporization, this is the energy required to transform a given quantity of a substance into a gas at a fixed temperature.

Sinusoidal Component: For weight differences less than the optimal_weight, the loss is calculated using a sinusoidal function. The parameters A, f, and ϕ control the amplitude, frequency, and phase of the sine wave, respectively. This part of the loss function might represent the oscillatory behavior of molecules in the liquid phase, where energy levels are confined and molecules are closer together.

Phase Shift Component: For weight differences greater than or equal to the optimal_weight, the loss transitions into a phase shift component governed by the parameter C and the transition_slope. The constant C represents the initial level of the vapor phase, where energy levels are more stable. The transition_slope defines the abruptness of this transition, with a slope close to 0 indicating an abrupt change. This mimics the behavior of a substance undergoing a phase transition, where the energy levels change sharply at the critical point.

In summary, this loss function can be seen as a mathematical representation of the physical process of vaporization. By tuning the parameters of the loss function, including the transition_slope, the model is effectively learning the underlying characteristics of the phase transition. This allows for a more nuanced understanding of the learning process, analogous to the Enthalpy of vaporization in physical chemistry.

Walter’s Phase Shift Optimizer

class PhaseShiftOptimizer(torch.optim.Optimizer):
    def __init__(self, params, lr, threshold, shift_boundary, min_gradient_magnitude, optimal_weight, A, f, phi, C, phase_shift_loss_weight, transition_slope):

        # Initialize the optimizer with phase shift parameters.
        defaults = dict(lr=lr, threshold=threshold, shift_boundary=shift_boundary, min_gradient_magnitude=min_gradient_magnitude, optimal_weight=optimal_weight, A=A, f=f, phi=phi, C=C, phase_shift_loss_weight=phase_shift_loss_weight, transition_slope=transition_slope)
        print("Initializing \U0001F4E1  PhaseShiftOptimizer \U0001F4E1  with parameters:\n", defaults)

        super(PhaseShiftOptimizer, self).__init__(params, defaults)
        self.gradient_history = deque(maxlen=2)

    def step(self, closure=None):
        # Perform an optimization step, applying phase shift logic if conditions are met.
        #print("Starting optimizer step...")  # Debug
        loss = None
        if closure is not None:
            loss = closure()

        phase_shift_details = []

        for group in self.param_groups:
            # Extract parameters
            lr, threshold, shift_boundary, min_gradient_magnitude, optimal_weight, A, f, phi, C, phase_shift_loss_weight, transition_slope = self.extract_parameters(group)
            #print(f"Extracted parameters: lr={lr}, threshold={threshold}, shift_boundary={shift_boundary}, optimal_weight={optimal_weight}, A={A}, f={f}, phi={phi}, C={C}, transition_slope={transition_slope}")  # Debug
            most_significant_weight = None
            max_gradient_change = 0

            for idx, p in enumerate(group['params']):
                if p.grad is None:
                grad =
       -= lr * grad
                most_significant_weight, max_gradient_change = self.apply_phase_shift_logic(idx, grad, max_gradient_change, min_gradient_magnitude, p, phase_shift_details)

            # Additional logic for phase shift
            if most_significant_weight is not None:
                phase_shift_direction = torch.sign(most_significant_weight.grad)
                #print(f"Applying phase shift: phase_shift_direction={phase_shift_direction}")  # Debug
                self.apply_phase_shift(most_significant_weight, phase_shift_direction, threshold, shift_boundary)

        # Update gradient history
        self.gradient_history.append([p.grad.clone() for p in self.param_groups[0]['params']])

        return loss, phase_shift_details

    def phase_shift_loss(self, weight, optimal_weight, A, f, phi, C, transition_slope):
        # Compute the phase shift loss, a combination of sinusoidal and linear parts.
        #print(f"Calculating phase shift loss: optimal_weight={optimal_weight}, A={A}, f={f}, phi={phi}, C={C}, transition_slope={transition_slope}")  # Debug
        sinusoidal_part = A * torch.sin(2 * np.pi * f * weight + phi)
        phase_shift_part = C + transition_slope * (weight - optimal_weight)
        loss_values = torch.where(weight < optimal_weight, sinusoidal_part, phase_shift_part)
        return loss_values.mean()

    def extract_parameters(self, group):
        # Extract parameters for the current optimization step.
        #print("Extracting parameters...")  # Debug statement
        lr, threshold, shift_boundary, min_gradient_magnitude, optimal_weight, A, f, phi, C, phase_shift_loss_weight, transition_slope = group['lr'], group['threshold'], group['shift_boundary'], group['min_gradient_magnitude'], group['optimal_weight'], group['A'], group['f'], group['phi'], group['C'], group['phase_shift_loss_weight'], group['transition_slope']
        #print(f"Extracted phase_shift_loss_weight: {phase_shift_loss_weight}")  # Debug statement
        return lr, threshold, shift_boundary, min_gradient_magnitude, optimal_weight, A, f, phi, C, phase_shift_loss_weight, transition_slope

    def apply_phase_shift_logic(self, idx, grad, max_gradient_change, min_gradient_magnitude, p, phase_shift_details):
        # Determine if a phase shift should be applied based on gradient changes.
        #print(f"Applying phase shift logic for param {idx}")  # Debug
        previous_grad = self.gradient_history[-1][idx] if len(self.gradient_history) > 0 else torch.zeros_like(grad)
        gradient_change = torch.abs(grad - previous_grad).max()
        most_significant_weight = None  # Initialize variable

        if gradient_change > max_gradient_change and gradient_change > min_gradient_magnitude:
            max_gradient_change = gradient_change
            most_significant_weight = p
                'gradient_change': max_gradient_change.item(),
                'phase_shift_value': gradient_change.cpu().numpy().tolist()
        return most_significant_weight, max_gradient_change

    def apply_phase_shift(self, most_significant_weight, phase_shift_direction, threshold, shift_boundary):
        # Apply the phase shift to the most significant weight.
        #print(f"Applying phase shift: most_significant_weight={most_significant_weight}, phase_shift_direction={phase_shift_direction}, threshold={threshold}, shift_boundary={shift_boundary}")  # Debug
        with torch.no_grad():
            phase_shift_value = phase_shift_direction * threshold
            padded_shift_value = torch.clamp(phase_shift_value, -shift_boundary, shift_boundary)
            most_significant_weight += padded_shift_value

When running an Optuna training session of 35 trials and 10 epochs using the same hyperparameters with Adam and Walter’s Phase Shift Optimizer I get the following results. I think this was using the CINIC-10 dataset, but it might have the CIFAR-10. Training took about 20 hours on a Geoforce RTX 4060Ti and was using things like cross-validation and early stopping.

View / Download the Files on GitHub