Notes on: Nguyen, H. C., Zecchina, R., & Berg, J. (2017): Inverse statistical problems: from the inverse ising problem to data science

Table of Contents

Krister notes

  • Lacoperone

Notation

  • nguyen17_inver_statis_probl_3cd6585c90a8675bffac04901b83b7c623fc740a.png denotes spin of i-th particle
  • nguyen17_inver_statis_probl_90dfb24f0ee0bffedb603d82ef7506cc14f393a4.png coupling between i-th and j-th particle
  • nguyen17_inver_statis_probl_dd5f6d1b0d4c10c780ce2458830ca847ae491d92.png refer to external local fields affecting the i-th particle
  • The Hamiltonian

    nguyen17_inver_statis_probl_75e4367f1bd4a000eb4f5a6af99706dc3cc775f3.png

  • nguyen17_inver_statis_probl_f2eebbd62e41be9d5a2457b4bd291dd096896884.png denotes a random spin variable
  • nguyen17_inver_statis_probl_9493b9776211ef51195c1379ad5f5540a04e566e.png denotes a realisation of nguyen17_inver_statis_probl_f2eebbd62e41be9d5a2457b4bd291dd096896884.png

Inverse Ising Model problem

  • Goal: determine couplings nguyen17_inver_statis_probl_90dfb24f0ee0bffedb603d82ef7506cc14f393a4.png and local fields nguyen17_inver_statis_probl_3807b8495f01004f1d88370e2f4fad2f5032db06.png, given as set of nguyen17_inver_statis_probl_f75d7681f9d0acecef24eb637ee201aa5b39199a.png observed spin configurations

nguyen17_inver_statis_probl_544ecbd629a3bfa213ed86a506cb63a32ab3de58.png

is the Boltzmann equilibrium distribution for where we have "subsumed" temperature into the couplings and fields.

This distribution also has the property that it maximizes the (Gibbs) entropy

nguyen17_inver_statis_probl_8f1a960a470a856ce759b3a711ac1912cb0d6e57.png

under the constraint that nguyen17_inver_statis_probl_987b6566628e950eb667da728dc762c1408d38f0.png is normalized and has a particular first and second moments, that is, magnetisations and correlations.

The inverse Ising problem is the determination of the couplings nguyen17_inver_statis_probl_90dfb24f0ee0bffedb603d82ef7506cc14f393a4.png and local fields nguyen17_inver_statis_probl_dd5f6d1b0d4c10c780ce2458830ca847ae491d92.png, given a set of nguyen17_inver_statis_probl_f75d7681f9d0acecef24eb637ee201aa5b39199a.png observed spin configurations nguyen17_inver_statis_probl_c5d646700b33dabb5ee19d34cc8de0585cddc91c.png.

Applications

Two distinct settings of inverse Ising problem: equilibrium and non-equilibrium.

  • Reconstruction of neural and genetic networks
  • Determination of three-dimensional protein structures
  • Fitness landscape (quantifies the average reproductive success of an organism with particular genotype, i.e. a particular DNA sequence)
  • Bacterial responses to combinations of antibiotics
  • Flocking dynamics

Gene networks

  • Proteins are macromolecules consisting of long chains of amino acids.
  • The particular sequence of a protein is encoded in DNA, a double-stranded heliz of complementary nucleotides.
  • Specific parts of DNA, the genes , are transcribed by polymerases, producing a single-stranded copy called m(essenger)RNA, which are translated by ribosomes, usually multiple times, to produce proteins.
  • The process of producing protein molecules from the DNA template by transcription and translation is called gene expression.
  • Expression of a gene is tightly controlled to ensure that the right amounts of proteins are produced at the right time.
  • Transcription factors are proteins which affect the expression of a gene (or several) by binding to DNA near the transcription start site of that gene (the regulatory region of a gene). Transcription factors open up the DNA allowing a gene to be expressed.

Transcription factors bind to the DNA near the transcription start site of a gene to repress the gene, i.e. make it so that it cannot bind to protein molecule currently being produced. This helps regulate the expression of a gene.

  • Important control mechanism is transcription factors, proteins which affect the expression of a gene (or several) by binding to DNA near the transcription start site of that gene; this part is called the regulatory region of a gene.
  • A target gene of transcription factor may in turn encode another transcription factor, leading to cascade of regulatory events.
  • Further complications, binding of multiple transcription factors in the regulatory region of a gene leads to combinatorial factors on the expression of a gene
  • Thus, one pose the question; can the regulatory connections between genes be inferred from data on gene expression, that is, can we learn the identity of transcription factors and their targets?
Simultaneous measurement of expression levels of all genes
  • Measuring mRNA levels allows determination of regulation of the transcription process
  • Recent advances (microarrays and reverse transcription and high-throughput sequencing) measuring mRNA is has now become trivial
  • Regulation of translations have been neglected recently due to the obtaining regulation of transcription being much easier to obtain (as noted above)
  • Microarrays
    • Thousands of short DNA sequences, called probes, grafted to the surface of a small chip
    • After converting mRNA to DNA by reverse transcription, cleaving that DNA into short segments, and flourescently labelling the resulting DNA sequence on the chip
    • Reverse transcription converts mRNA to DNA, a process which requires a so-called reverse transcriptase as an enzyme
    • Amount of fluorescent DNA bound to a particular probe depends on the amount of mRNA originally present in the sample
    • RElative amount of mRNA from a particular gene can then be inferred from the flourescence signal at the corresonding probes
    • Limitation: large amount of mRNA required
      • mRNA sample is taken from a population of cells
      • Cell-to-cell fluctuations of mRNA concetrations are averaged over
    • To obtain time series, populations of cells synchronized to approximately the same stage in the cell cycle are used
  • Reverse transcription of mRNA follwed by high-throughput sequencing of DNA segments
    1. Reverse transcription of mRNA
    2. High-throughput sequencing of the resulting DNA segments
    3. Relative mRNA levels follow directly from counts of sequence reads

Equilibrium reconstruction

Maximum likelihood

Maximum entropy modelling

Notation
  • nguyen17_inver_statis_probl_f75d7681f9d0acecef24eb637ee201aa5b39199a.png balls
  • nguyen17_inver_statis_probl_6dd4ec21f92a21e77e43374b4c7fd1cc8c2234c6.png compartments
  • nguyen17_inver_statis_probl_39be0a62593fe24519aa74c32e53cdbee2e6a999.png denotes number of balls in the r-th compartment
  • nguyen17_inver_statis_probl_7bb2a191dae12240100b1a15f30f3b089b2e4994.png fraction of nguyen17_inver_statis_probl_f75d7681f9d0acecef24eb637ee201aa5b39199a.png balls in r-th compartment
  • nguyen17_inver_statis_probl_fe7108f38f2db55db9cb7eee3f63077437a2e510.png denotes the number of possible arrangements
Stuff

nguyen17_inver_statis_probl_024849562041f21a94f5a691705001bb60e29c0e.png

with

nguyen17_inver_statis_probl_d01ea88e9d3852e6b4356993ed96824f2a9d3231.png

For lage nguyen17_inver_statis_probl_f75d7681f9d0acecef24eb637ee201aa5b39199a.png, we write nguyen17_inver_statis_probl_a0af44a0f532b6d8d5f677a300db55be0cf8aa32.png and exploit Stirling's formula:

nguyen17_inver_statis_probl_b20f55e3e66c22ba1b26ebb8b9a1f424cd66ae42.png

yielding the Gibbs entropy

The Gibbs entropy is given by

nguyen17_inver_statis_probl_9eb6ec07b46f30231f52588ac2b0e6a8add44af2.png

If we assume that each state / arrangement with a given total "energy" is equally likely, the statistics of nguyen17_inver_statis_probl_7bb2a191dae12240100b1a15f30f3b089b2e4994.png is dominated by a sharp maximum of nguyen17_inver_statis_probl_fe7108f38f2db55db9cb7eee3f63077437a2e510.png as a function of the nguyen17_inver_statis_probl_7bb2a191dae12240100b1a15f30f3b089b2e4994.png, subject to the constraints

nguyen17_inver_statis_probl_cf4fc2fc4572cc09526d04173278a0b7f56589ff.png

Using Lagrage multipliers to maximize the Gibbs entropy subject to the above constraints yields the Boltzmann distribution.