Entropy
Entropy is the universe’s most universal law, shaping everything from energy flow to information.
- In mathematics, it’s a formula that measures uncertainty.
- In physics, it’s the law that governs how heat and energy spread.
- In chemistry, it’s the tally of molecular disorder.
- In information theory, it’s the currency of surprise in messages, data compression and transfer over noisy channels.
- In machine learning, it’s the score of impurity in decision-making.
But are all of these really talking about the same thing?
Are they all just using one tool entropy dressed in different clothes?
Let’s unpack each of them.
Physics
The most common way to describe entropy is as disorder, random, more mixed and less ordered.
But on a very fundamental level let’s look it in a way where I want to know how something is distributed in a system.
Here are two important words system and distributed.
System - In science, a system is simply the part of the universe you choose to focus on or study.
Everything outside that part is called the surroundings or environment.
For example you want to study about what is happening inside a bag or a kettle or a kitchen.
Types of systems based on interaction with surroundings:
Closed system
- Exchanges energy (like heat or work) but not matter with its surroundings.
- Example: A sealed, insulated container where heat can pass through the walls, but no gas or liquid escapes or enters.
Open system
- Exchanges both energy and matter with its surroundings.
- Example: A boiling pot of water without a lid steam (matter) escapes, and heat (energy) flows in/out.
Isolated system - Exchanges neither energy nor matter with its surroundings. - Example: An ideal thermos bottle perfectly insulated so nothing gets in or out (theoretical, as perfect isolation is impossible).
Distributed - Defining Distribution can be tricky because first we have to define exactly what that thing is we want to measure the distribution and then we can ask whether that thing is distributed uniformly, skewed or follows any particular pattern
For example - if I treat a living room as an isolated system containing 100 pieces of clothing, I might ask how they’re distributed: neatly folded in one corner or scattered randomly across the room.
- Suppose a person enters the living room blindfolded, playing a game where they try to pick up a piece of clothing every 5 seconds, repeating this 100 times.
- If the clothes are scattered randomly (high entropy), they’re more likely to pick up a piece because the clothes are spread out.
- If the clothes are neatly folded in one corner (low entropy), they’re less likely to pick one up unless they happen to walk to that corner.
Entropy describes how energy, information, heat, clothes, or anything else is distributed within a system. It doesn’t measure the quantity or intensity of the thing, only the likelihood of its arrangement.
Now let’s look at the example of hot and cold metal bars to solidfy the idea of energy distribution aka entropy.
A 300°C iron metal rod is there what are the factors that can affect the energy distribution.
- Temperature: It defines how particles move. Higher temperature leads to higher kinetic energyatoms vibrate more vigorously. This increases the number of accessible microstates, affecting the distribution and raising entropy.
- Pressure or Volume: These can affect the density of states. For solids like iron, volume changes are small, but in gases, they significantly impact how particles spread out.
- Number of Particles: More particles mean more interactions and jiggling, leading to higher entropy (entropy is extensive).
- Microstates (W): From a statistical view, entropy depends on the number of possible positions, momenta, and quantum states of particles (S = k ln W).
- State of Matter (Solid/Liquid/Gas): Solids have low entropy due to ordered structures; liquids have more disorder; gases have the highest entropy from free particle motion.
Now here i have listed few, it can be more, the more things we get to know about a particular system the more accurately we can predict or calculate its entropy in a given framework we are working.
As you can see these parameters don’t act independently; they’re deeply interrelated. Measuring the effect of each on entropy in isolation is complicated because changing one often changes others. Here’s a more precise view:
Why they are interdependent:
- Temperature T affects how particles move, which influences heat capacity Cp because heat capacity often varies with temperature.
- Changing volume VV changes the pressure P (for gases) and the available phase space, which in turn affects energy levels and particle behavior.
- Number of particles N changes density, which impacts how particles interact, affecting heat capacity and accessible microstates W.
- The microstates W depend on all of these combined temperature, volume, particle number, and quantum states collectively define how many microstates are available.
How to measure the affect of each parameter considering interdependence:
- Control variables carefully in experiments or simulations
- Change one parameter while keeping others fixed as much as possible (e.g., vary temperature at constant volume and particle number).
- Measure entropy change experimentally or compute it via statistical methods.
Now let’s suppose we want to calculate the entropy of the iron rod at 300C we have to first define the system in this way.
Step 1: Define the system and assumptions
- The rod is uniform and homogeneous (same material properties throughout).
- Heat transfer is slow enough to assume quasi-static (reversible) heating.
- The rod’s mass m, specific heat capacity Cp(T), are known or measurable.
- Ignore volume changes if thermal expansion is negligible (common for solids).
Step 2: Define a reference
- Entropy is always measured relative to a reference state (often at a baseline temperature or zero entropy state).
- You don’t just “pick a random state and say the entropy is X” without context.
- You measure or calculate the change in entropy ΔS=S_final−S_initial when the system moves from one state to another (e.g., energy x at time t1 to energy y at time t2).
- Absolute entropy values can be tabulated (like standard molar entropy), but these are always relative to a defined zero-point.
Step 3: Use physics formulae to calculate one - If the system was loosing energy after time t2 we would say the entropy of the rod has decreased,why because kinetic energy has reduced, leading to less jiggle, leading to less movement and microstate but surroundings gains that energy, increasing their entropy by a greater amount (because the surroundings are usually at a lower temperature or can spread the heat more effectively).
- If the system was gaining energy from outside source after time t2 we would say the entropy has increased it has gained that energy from the surrounding.
After going through example it will be clear, three things that defines entropy completely -
- Distribution of what exactly are we measuring (energy, heat, information, clothes)
- Variables affecting the distribution (tempeature, pressure, state)
- how do we measure the distribution.
- For example, In a 100 sq, meter room if 100 clothes are thrown at random we can measure the distrbution by counting how many clothes are there in each 1 sq, meter box.
- For the heated rod, count the number of microstates an atom can take (e.g., unique vibrational or electronic states).
The above defines the whole of Thermodynamic entropy (physical entropy).
This depends on the actual physical state of the system: energy, temperature, molecular configurations, etc.
More Physics Way
If you have noticed one thing we were till now measuring the distribution using counting method this is actually Boltzmann Entropy (Simple Microstates Counting)
Boltzmann Entropy (Simple Microstates Counting)
- Applies to a system in a single macrostate, where all microstates are equally likely.
- Entropy is based on counting how many microstates Ω correspond to that macrostate:
- Example: Imagine you have 3 coins lying on a table, and you only care about how many coins show heads.
- Macrostate: Exactly 2 coins are heads.
- Microstates: The specific arrangements that have 2 heads and 1 tail. There are 3 such microstates (HTH, HHT, THH).
- Since all these microstates are equally probable, Boltzmann entropy counts these 3 microstates.
Gibbs Entropy (Probability Weighted)
- Applies when the system is in a statistical mixture of microstates, each with a probability pipi.
- Entropy accounts for the uncertainty over which microstate the system is actually in:
- Example: Same 3 coins, but now you know the probability of each microstate is different:
- Suppose probability that the coins are in microstate 1 (HTH) is 0.5,
- Microstate 2 (HHT) is 0.3,
- Microstate 3 (THH) is 0.2.
- Now entropy measures the uncertainty weighted by these probabilities.
There is another one more of way looking at entropy from Clausius -
entropy as the unavailability of energy to do useful
work a practical way to understand what entropy means
beyond just counting microstates.
What does this mean?
Energy can exist in many forms: heat, mechanical work, chemical energy, etc.
Not all energy in a system can be converted into useful work. Some energy is “lost” or dispersed in ways that can’t be harnessed to do things like move a piston, run a motor, or power a machine.
Entropy quantifies that “lost” or unavailable portion of energy the part that is spread out or randomized so much that you cannot concentrate or convert it back into work.
Intuition and Example:
High-quality energy (like work or high-temperature heat) can be fully converted into useful work. i.e a heated iron rod can be used to heat up a water. Entropy is lower.
Low-quality energy (like heat evenly spread in the environment at room temperature) can’t be fully converted to work. a room temperature rod can’t be used to heat up water hence entropy is higher.
Entropy measures how much energy has “degraded” from high-quality to low-quality, hence reducing the capacity to do work.
When energy disperses, it spreads over many particles and states increasing the number of accessible microstates.
The system moves from an ordered, low-entropy state (few microstates, energy concentrated, usable) to a disordered, high-entropy state (many microstates, energy spread out, less usable).
So “lost energy” is the energy tied up in many microscopic configurations that can’t be coordinated to do macroscopic work.
Interesting example Absolutely! Here’s a clear, jargon-free write-up using the sun example that connects entropy across different domains thermodynamics and information step by step, perfect for your blog:
Step 1- The Sun as a Hot, Complex System
The sun is incredibly hot and full of activity. Inside it, trillions of tiny particles move in countless ways. Because of all this complexity, scientists say the sun has high entropy. This means there are many possible ways the sun’s particles can be arranged inside it a lot of microscopic disorder.
Step 2- Looking at the Sun and Space Together
Now, if we consider not just the sun but the space around it the light and heat it sends out we get a bigger system. The sun transfers energy to space all the time, and this movement adds to the total disorder or entropy of this larger system. hence sun’s entropy is lower but overall system entropy is hight
Step 3- What About Predicting the Sun’s State?
Even though the sun is complex inside, from far away, it looks very stable. For example, we know it will be hot tomorrow, just like today. Because of this, our uncertainty about the sun’s overall state is very low. We don’t need much new information to describe what the sun will be like.
Step 4- Two Different Kinds of Entropy
Here’s the key: The word “entropy” is used in two related but different ways.
- One kind of entropy measures how complex or disordered a system is inside like the sun’s many tiny particle arrangements.
- The other kind measures how uncertain or surprised we are about what the system will do like how much information we need to describe the sun’s state tomorrow.
Step 6- The Role of Granularity How Closely You Look Matters
The “overall state” of the sun depends on how closely or in detail you choose to look.
If you only care about big things like whether the sun is hot or not then the sun’s state seems very simple and predictable, and you need little information to describe it.
But if you zoom in to tiny details the exact position and energy of every particle inside the sun the state becomes incredibly complex and uncertain, and you’d need a huge amount of information to describe it fully.
So, the amount of entropy or uncertainty you assign depends on the level of detail you’re considering.
An object with low entropy has less uncertainty about its state but this depends on how detailed or ‘fine-grained’ the description of the state is.
what will happen if we get to know more things the uncertaininty decreases hence the information entropy decreases when information is hidden uncertainty grows, and information entropy increases.
Information Entropy
A key measure in information theory is entropy. Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process.
For example, consider rain in two locations: the Amazon rainforest and a tropical city. In the Amazon, it rains almost every day, so the outcome is fairly predictable low uncertainty, low entropy. In the tropical city, rain is less predictable, so the uncertainty is higher higher entropy.
A very important question can arise why are we measuring uncertainty and not certainty itself, because if we talk about any process and if something is guranteed it conveys no new information.
Next question is how do we calculate uncertainty in shanon’s paper the way he defines the property of entropy is as following -
Suppose we have a set of possible events whose probabilities of occurrence are (p1; p2; : : : ; pn).
These probabilities are known but that is all we know concerning which event will occur.
If take weather prediction example p1 can be probability of rain, p2 can be probability of strong wind, p3 could be probability of high humidity.
can we measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome?
If there is such a measure, say H( p1; p2; : : : ; pn), it is reasonable to require of it the following properties:
Axiom 1.
- H should be continuous in the p_i.
What it means:
Continuity here means that small changes in the probabilities of events should lead to small changes in entropy.
Entropy shouldn’t “jump” suddenly if the probabilities change just a little.
For a fair coin the probability of Head and Tail is same and entropy is at its maximum H.
Now, slightly bias it: p_h = 0.51 and p_t =0.49
The uncertainty has decreased a tiny bit, because the coin is slightly more predictable now.
Entropy changes slightly it doesn’t suddenly drop to zero or skyrocket.
If H were not continuous, then a tiny change in probability could make the entropy jump wildly, which wouldn’t make sense small changes in our knowledge shouldn’t cause a huge change in measured uncertainty.
Axiom 2.
If all the pi are equal, pi = 1/n, then H should be a monotonic increasing function of n.
With equally likely events there is more choice, or uncertainty, when there are more possible events.
What it means:
For a fair coin and a fair dice both have equal probability of each event but
rolling a die has more probable chances means more uncertainty.
Even though each outcome is equally likely in both cases, rolling a die is less
predictable than flipping a coin more possible choices, more uncertainty, higher entropy.
Axiom 3.
If a choice be broken down into two successive choices, the original H should be the
weighted sum of the individual values of H.
What it means:
if we are calculating the probability of an event by breaking it into multiple choices
which we do in Law of total probability calculation same way entropy should also add up.
Hence we are adding up the entropy at each step from start to the end node.
The only function which can do all this \[ H(i) = - \log(pi) \]
Hence entropy is converted in log probabilities, now we are just calculating entropy of one event but in a random process there can be multiple outcomes and so instead of thinking about uncertainity of one event let’s think about average uncertainity.
Note:- Shanon Gave how this formulae satisfies few more properties and how with limit we can get the proof of this. But we are skipping it.
\[ H = - \sum_{i=1}^{n} p_i \log(p_i), \quad \text{where } p_i \text{ is the probability of the $i$-th event} \]
The reason for log being base 2 because we are dealing in binary signals either 1 or 0.
Hence -log(pi) is the suprise of seeing the outcome of the ith event
If this was a random variable X with Probability mass function as P where P(X) is pmf evaluated at X then Entropy can be written as
\[ H(X) = E [- \log(P(X))] \]
E is expectation i.e weighted sum of all outcomes. One thing to note here the entropy is calculated here for independent events.
Conditional Entropy
(a) Independent Random Variables
If (X) and (Y) are independent:
\[ P(X=x, Y=y) = P(X=x) \cdot P(Y=y) \]
The joint entropy is:
\[\begin{aligned} H(X,Y) &= - \sum_x \sum_y P(X=x, Y=y) \log P(X=x, Y=y) \\ &= - \sum_x \sum_y P(X=x) P(Y=y) \log \big(P(X=x) P(Y=y)\big) \\ &= - \sum_x \sum_y P(X=x) P(Y=y) (\log \big(P(X=x)) +\log \big(P(Y=y)\big)) \\ &= - \sum_x P(X=x) \log \big(P(X=x)) - \sum_y P(Y=y) \log \big(P(Y=y)\big) \\ &= H(X) + H(Y) \end{aligned}\]Intuition
Independent: Knowing X tells nothing about Y → total uncertainty is sum.
If X and Y are independent, then:
\[ H(Y∣X)=H(Y) \]
because knowing X tells you nothing about Y.
Joint Entropy
Dependent Random Variables
If (X) and (Y) are dependent:
\[ P(X=x, Y=y) \neq P(X=x) \cdot P(Y=y) \]
The joint entropy is:
\[ H(X,Y) = - \sum_x \sum_y P(X=x, Y=y) \log P(X=x, Y=y) \]
But it cannot be split as (H(X)+H(Y)). Instead, we use conditional entropy:
\[ H(X,Y) = H(X) + H(Y|X) \]
Where conditional entropy is:
\[ H(Y|X) = - \sum_x \sum_y P(X=x, Y=y) \log P(Y=y | X=x) \]
- If (X) and (Y) are independent: (H(Y|X) = H(Y)) → reduces to additive case.
Intuition
Dependent: Knowing X reduces uncertainty about Y → total uncertainty is less than the sum.
But if X and Y are dependent, then knowing X reduces the uncertainty about Y. In that case: \[
H(Y∣X)≤H(Y)
\] with strict inequality unless they’re independent.
Information Gain
Now we have learned Entropy of a random variable X as H(X) If two random dependent random
variable X and Y exists then knowing X reduces uncertainty of Y by amount H(Y|X) called Conditional Probability.
If I have to calculate entropy of an event with two dependent random variables then H(X,Y) called the Joint Probability.
Then another natural question can be asked if know the entropy of X as H(X) as a whole and if i get to know about Y then by what amount entropy of X is reduced.
i.e \[ H(X) - H(X|Y) = ? \]
This question mark is called Information gain which is quite obvious.
If i get to know Y then by what amount did i gain information on X.
Similar way there will be some Information gain I when we reverse the situation.
\[ I(Y) = H(Y) - H(Y|X) \]
now another way of looking at this we know joint probability from both side is symetric i.e
\[\begin{aligned} H(X,Y)=H(X)+H(Y∣X)\\ H(X,Y)=H(Y)+H(X∣Y) \\ \text{hence by re-arranging you will get} \\ H(X)+H(Y∣X) = H(Y)+H(X∣Y) \\ H(X)- H(X∣Y) = H(Y) - H(Y∣X) \\ I(X;Y) = H(X)- H(X∣Y) = H(Y) - H(Y∣X) \\ \end{aligned}\]If this make sense then you will understand Joint probability is talking about total information about both the variables information gain is talking about the information gain for one variable and because it is equivalent we call it mutual information.
KL Divergence
Information gain also be defined by unpacking \[ I(X;Y)=\sum_{x,y}p(x,y)\log \frac{p(x,y)}{p(x).p(y)} \]
Now look at the ratio inside the log:
- Numerator: the true joint probability p(x,y).
- Denominator: the independent assumption p(x)p(y).
This says:
Mutual information measures how far reality (the true joint) is from the case where X and Y are independent.
That’s exactly a KL divergence:
\[ I(X;Y)=D_{KL}(p(x,y)∣∣p(x)p(y)) \]
Intuition
KL divergence is just the “price of assuming the wrong distribution.”
- If you thought X and Y were independent, you’d model their joint as p(x)p(y).
- The true joint is p(x,y).
- KL divergence tells you how many extra bits you waste, on average, by using the wrong assumption.
So KL comes in very organically: it’s the distance between your mental model (independence) and reality (the true joint).
More standard way of defining KL Divergence is
\[ D_{KL}(P∣∣Q)=\sum_{z}P(z)\log\frac{P(z)}{Q(z)} \]
Step 1: Why the ratio \(\frac{P}{Q}\)?
Imagine you expected the world to follow distribution \(Q\).
But in reality, the world follows \(P\).
At each outcome \(z\), the ratio
\[ \frac{P(z)}{Q(z)} \]
tells you how “surprised” you are when comparing reality \(P\) vs. your assumption \(Q\).
- If \(P(z) > Q(z)\): outcome happens more often than you thought → ratio > 1.
- If \(P(z) < Q(z)\): outcome happens less often than you thought → ratio < 1.
So the log term is literally a correction factor in “bits” or “nats”.
tep 2: Why weight by \(P(z)\)?
If you only look at the ratio, you’re ignoring how often those outcomes really occur.
- For example, suppose you miss a rare outcome (\(P(z)\) tiny) then the ratio might be huge, but it barely matters because it happens rarely.
- If you consistently mis-estimate a common outcome (\(P(z)\) large), then the error really matters.
Thus, you average the correction using the true distribution \(P\). That’s why it’s an expectation:
\[ D_{KL}(P||Q) = \mathbb{E}_{z\sim P}\Big[\log \frac{P(z)}{Q(z)}\Big] \]
Step 3: Intuitive Example
Say we flip a coin:
- Reality: \(P = (0.9 \text{ heads}, 0.1 \text{ tails})\).
- Your belief: \(Q = (0.5, 0.5)\).
Now compute:
\[ D_{KL}(P||Q) = 0.9 \log \frac{0.9}{0.5} + 0.1 \log \frac{0.1}{0.5} \]
For heads: ratio = \(0.9/0.5 = 1.8\). You consistently under-estimated how often heads appear, so log term is positive → penalty is big, and since heads are frequent, the weight (0.9) amplifies it.
For tails: ratio = \(0.1/0.5 = 0.2\). You thought tails were more common than reality. Log term is negative, but it only happens 10% of the time, so its effect is small.
Total KL > 0 means your model \(Q\) is “diverging” from truth.
Step 4: Geometric view
Think of KL as how tilted one distribution is compared to another.
\(P\) is the “true shape.”
\(Q\) is your “assumed shape.”
The ratio \(\frac{P}{Q}\) shows the local stretching or shrinking.
The weighting by \(P\) means: only care about distortion where reality actually lives.
Ratio: local mismatch.
Log: expressed in information units.
Weighted by \(P\): average mismatch under reality.
KL divergence has a very particular range:
\[ D_{KL}(P||Q) \;\; \in \; [0, \infty] \]
Why lower bound is 0
- KL divergence is always non-negative (Gibbs’ inequality).
- \(D_{KL}(P||Q) = 0\) iff \(P = Q\) exactly (they’re identical distributions everywhere).
Why no finite upper bound
- If there’s any point \(z\) where \(P(z) > 0\) but \(Q(z) = 0\), then the ratio \(\frac{P(z)}{Q(z)}\) blows up → KL divergence goes to \(\infty\).
- Even if both are >0 everywhere, if \(Q\) assigns tiny probabilities where \(P\) assigns big ones, KL can become arbitrarily large.
Example (discrete coin):
\(P=(0.5, 0.5)\), \(Q=(0.5, 0.5)\):
\[ D_{KL}(P||Q)=0 \]
\(P=(0.9, 0.1)\), \(Q=(0.5, 0.5)\):
\[ D_{KL}(P||Q)\approx 0.368 \]
\(P=(0.9, 0.1)\), \(Q=(1.0, 0.0)\):
- For tails, \(P=0.1\), \(Q=0\). Ratio → ∞.
- So KL → ∞.
- KL is asymmetric: \(D_{KL}(P||Q) \neq D_{KL}(Q||P)\).
- Both are ≥ 0, but can blow up in different places.
- Minimum = 0 (when the two distributions are identical).
- Maximum = ∞ (when \(Q\) misses events that \(P\) considers possible).
Cross Entropy
Let’s take Data D, with input features as X and actual labels as A, Model M with θ parameters trying to predict the labels where Y is the predicted labels.
\[ Y = M_{\theta}(X) ≈ p(A∣X;θ) \]
Hence A is my actual label distribution and Y be the predicted label distribution.
Couple of things to notice 1. Entropy of truth A is H(A).
Entropy of H(Actual Lables A |for features X) will be >0 in a real world setting but in a determinsitc problem where knowing X can actually predict A easily like a square function then this value is 0.
H(A|X) becomes Bayes irreducible uncertainty i.e entropy or unpredicitability nature of A when we fully know about X i.e because of the randomness or leftover ambiguity.
The goal of learning is to do H(Y) ≈ H(A), so that the model has captured all the predictable structure in A - it’s uncertain only where A itself is uncertain.
The Information gain between X and A is a fixed number our Y is closer approximation of Y then IG(A;X)> IG(A;Y) for a perfectly trained model the value will be same.
By Data Processing Inequality (since A→X→Y) \[ I(A;Y)≤I(A;X). \]
\[\begin{aligned} IG(A;X) = H(A) - H(A|X) \\ H(A) = IG(A;X) + H(A|X) \\ \end{aligned}\]Entropy of A = Things that will modeled by model M + Things that our model can’t be learn
True best possible model is the Bayes classifier - \[ P^*(A∣X) \]
Our model gives some approximation: \[ Q(A∣X) \]
The KL/excess risk measures how far your predicted probabilities are from the Bayes-optimal probabilities, in expectation \[ E_{X}[D_{KL}(P^*(A∣X) ∣∣ Q(A∣X))]. \]
This KL term = estimation/approximation error the smaller the error the better the learning.
Hence It should be clear, that
Entropy of A = Entropy of A with respect to Q - irreducible error - KL terms
Ignoring the irreducible error and some rearraging we get,
Entropy of A with respect to Q = Entropy of A + KL term
this can be formally written as
\[ H(P^*, Q) = H(P^*) + KL terms \] To prove it we can write
\[\begin{aligned} D_{KL}(P^*(A∣X) ∣∣ Q(A∣X)) = \sum P^*(X) (\log P^*(A∣X)- \log Q(A∣X)) \\ D_{KL}(P^*(A∣X) ∣∣ Q(A∣X)) = \sum P^*(X) \log P^*(A∣X)- \sum P^*(X) \log Q(A∣X) \\ D_{KL}(P^*(A∣X) ∣∣ Q(A∣X)) = -H(P^*) + H(P,Q) \\ H(P,Q) = H(P^*) + D_{KL}(P^*(A∣X) ∣∣ Q(A∣X)) \\ \end{aligned}\]H(P,Q) is cross entropy it is not joint entropy notation it is Cross-Entropy.
Hence Entropy - The truth
Cross Entropy - You trying to predict the truth
KL terms - The error i.e how far you are from truth.
\[\begin{aligned}
Missing Information = IG(A;X) - IG(A;Y) \\
E_{X}[D_{KL}(P^*(A∣X) ∣∣ Q(A∣X))] = \text{Missing Information} \\
\end{aligned}\]
Physics and Information Theory Relation
From Wikipedia -
Relationship to thermodynamic entropy
The inspiration for adopting the word entropy in information theory came from the close resemblance between Shannon’s formula and very similar known formulae from statistical mechanics.
In statistical thermodynamics the most general formula for the thermodynamic entropy S of a thermodynamic system is the Gibbs entropy.
\[ S=-k_{\text{B}}\sum _{i}p_{i}\ln p_{i}\ \]
where kB is the Boltzmann constant, and pi is the probability of a microstate. The Gibbs entropy was defined by J. Willard Gibbs in 1878 after earlier work by Ludwig Boltzmann (1872).