The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Going to

finish up a little bit from last time on gene

regulatory networks and see how the different

methods that we looked at compared, and then we’ll

dive into protein interaction networks. Were there any questions

from last time? OK. Very good. So recall that we start off with

this dream challenge in which they provided unlabeled data

representing gene expression data for either in a completely

synthetic case, in silico data, or for three different

actual experiments– one in E. coli, one in S.

cerevisiae, and one in aureus. For some of those, it was

straight expression data under different conditions. In other cases, there were

actual knock-down experiments or other kinds of perturbations. And then they gave that

data out to the community and asked people to use

whatever methods they wanted to try to rediscover

automatically the gene regulatory networks. So with some

preliminary analysis, we saw that there were a couple

of main clusters of kinds of analyses that all had similar

properties across these data sets. There were the

Bayesian networks, that we’ve discussed now

in two separate contexts. And then we looked at

regression-based techniques and mutual information

based techniques. And there were a bunch of

other kinds of approaches. And some of them actually

combine multiple predictors from different kinds

of algorithms together. And some of them, they

evaluated how well each of these did on all the

different data sets. So first the results

on the in silico data, and they’re showing

this as an area under the

precision-recall curve. Obviously, higher numbers

are going to be better here. So in this first

group over here are the regression-based

techniques, mutual information, correlation, Bayesian networks. Things didn’t fall into any of

those particular categories. Meta were techniques that

use more than one class of prediction and then develop

their own prediction based on those individual techniques. Then they defined something

that they call the community definition, which

they combine data from many of the

different techniques together with their own

algorithms to kind of come up with what they call the

“wisdom of the crowds.” And then R represents

a random collection of other predictions. And you can see that on

these in silico data, the performances

don’t dramatically differ one from the other. Within each class, if you

look at the best performer in each class, they’re all

sort of in the same league. Obviously, some of the classes

do better consistently. Now their point

in their analysis is about the wisdom

of the crowds, that taking all these data

together, even including some of the bad

ones, is beneficial. That’s not the main

thing that I wanted to get out of these

data for our purposes. So these E. coli data,

notice though that the errant to the curve, it’s about

30 something percent. Now this is, oh, sorry,

this is in silico data. Now this is the first

real experimental data we’ll look at, so

this is E. coli data. And notice the change of scale,

that the best performer’s only doing under less than 10% of

the possible objective optimal results. So you can see that the real

data are much, much harder than the in silico data. And here the performance

varies quite a lot. You can see that the Bayesian

networks are struggling, compared to some of

the other techniques. The best of those

doesn’t really get close to the best of some

of these other approaches. So what they did next, was they

took some of the predictions from their community

predictions that were built off of all these other data, and

they went and actually tested some of these. So they built regulatory

networks for E. coli and for aureus. And then they actually did

some experiments to test them. I think the results

overall are kind of encouraging, in the

sense that if you focus on the top pie chart

here, of all the things that they tested,

about half of them, they could get some support. In some cases, it was

very strong support. In other cases, it

wasn’t quite as good. So the glass is half

empty or half full. But also, one of the

interesting things is that the data

are quite variable over the different

predictions that they make. So each one of these circles

represents a regulator, and the things that they claim

are targets of that regulator. And things that are

in blue are things that were confirmed

by their experiments. The things with black outlines

and blue are the controls. So they knew that

these would be right. So you could see that for

pure R, they do very well. For some of these

others, they do mediocre. But there are some, which

they’re honest enough to admit, they do very poorly on. So they didn’t get any

of their predictions right for this regulator. And this probably reflects the

kind of data that they had, in terms of what conditions

were being tested. So, so far, things

look reasonable. I think the real

shocker of this paper does not appear in the

abstract or the title. But it is in one of the main

figures, if you pay attention. So these were the results

for in silico data. Everything looked pretty good. Change of scale to E. coli,

there’s some variation. But you can make arguments. These are the results for

Saccharomyces cerevisiae. So this is the organism,

yeast, on which most of the gene

regulatory algorithms were originally developed. And people actually

built careers off of saying how great

their algorithms were in reconstructing these

regulatory networks. And we look at these

completely blinded data, where people don’t know

what they’re looking for. You could see that the actual

results are rather terrible. So the area under the curve

is in the single digits of percentage. And it doesn’t seem to matter

what algorithm they’re using. They’re all doing very badly. And the community predictions

are no better– in some cases, worse– than the

individual ones. So this is really

a stunning result. It’s there in the data. And if you dig into

the supplement, they actually explain

what’s going on, I think, pretty clearly. Remember that all

of these predictions are being made by looking for a

transcriptional regulator that increases in its own

expression or decreases in its own expression. And that change in

its own expression is predictive of its targets. So the hypothesis is when you

have more of an activator, you’ll have more of

its targets coming on. If you have less

of an activator, you’ll have less of the targets. And you look through

all the data, whether it’s by Bayesian

networks or regression, to find those kinds

of relationships. Now what if those

relationships don’t actually exist in the data? And that’s what

this chart shows. So the green are genes

that have no relationship with each other. And they’re measuring here the

correlation across all the data sets, between two

pairs of genes, for which have no known

regulatory relationship. The purple are ones

that are targets of the same

transcription factor. And the orange

are ones where one is the activator or

repressor of the other. And in the in silico data,

they give a very nice spread between the green, the

orange, and the purple. So the co-regulator are

very highly correlated with each other. The ones that are parent-child

relationships– a regulator and its target– have a

pretty good correlation, much, much different

from the distribution that you see for the things

that are not interacting. And on these data, the

algorithms do their best. Then you look at

the E. coli data, and you can see that in

E. Coli, the curves are much closer to each other,

but there’s still some spread. But when you look

at yeast– again, this is where a lot

of these algorithms were developed– you

could see there’s almost no difference between the

correlation between the things that have no relationship

to each other, things that are co-regulated

by the same regulatory protein, or those parent-child

relationships. They’re all quite similar. And it doesn’t

matter whether you use correlation analysis

or mutual information. Over here and in this

right-hand panel, they’ve blown up the

bottom part of this curve, and you can see how

similar these are. So again, this is a

mutual information spread for in silico data for

E. coli and then for yeast. OK. So what I think we can say

about the expression analysis is that expression data

are very, very powerful for some things and

are going to be rather poor for some

other applications. So they’re very powerful for

classification and clustering. We saw that earlier. Now what those

clusters mean, that’s this inference problem

they’re trying to solve now. And the expression data are

not sufficient to figure out what the regulatory proteins

are that are causing those sets of genes to be

co-expressed– at least not in yeast. And I think there’s

every expectation that if you did the

same thing in humans, you would have the same result. So the critical

question then is if you do want to build models of

how regulation is taking place in organisms, what do you do? And the answer is that you

need some other kind of data. So one thing you might

think, if we go back to this core analysis,

like what’s wrong? Why is it that these gene

expression levels cannot be used to predict the

regulatory networks? And it comes down to

whether gene levels are predictive approaching levels. And a couple of groups

have looked into this. One of the earlier studies

was this one, now 2009, where they used microarray data

and looked at mRNA expression levels versus protein levels. And what do you see in this? You see that there is a trend. Right there, R

squared is around 0.2, but that there’s a huge spread. So that for any

position on the x-axis, a particular level of mRNA, you

can have 1,000-fold variation in the protein levels. So a lot of people

saw this and said, well, we know there are

problems with microarrays. They’re not really great

at predicting mRNA levels or low in protein levels. So maybe this will all get

better if we use mRNA-Seq. Now that turns out

not to be the case. So there was a very careful

study published in 2012, where the group used

microarray data, RNA-Seq data, and a number of different ways

of calling the proteomics data. So you might say, well,

maybe some of the problem is that you’re not doing a very

good job of inferring protein levels from mass spec data. And so they try a whole

bunch of these different ways of pulling mass spec data. And then they look,

you should focus on the numbers in these

columns for the average and the best correlations

between the RNA data in these columns and the

proteomic data in the rows. And you could see the

best case scenario– you can get these up to 0.54

correlation, still pretty weak. So what’s going on? What we’ve been focusing

on now is the idea that the RNA levels are going

to be very well correlated with protein levels. And I think a lot

of literature is based on hypotheses that

are almost identical. But in reality, of

course, there are a lot of processes involved. There’s the process

of translation, which has a rate

associated with it. It has regulatory steps

associated with it. And then there are

degradatory pathways. So the RNA gets

degraded at some rate, and the protein gets

degraded at some rate. And sometimes those

rates are regulated, sometimes they’re not. Sometimes it depends

on the sequence. So what would happen

if you actually measured what’s going on? And that was done recently

in this paper in 2011, where the group used a

labeling technique for proteins to [INAUDIBLE] and measure

steady state levels of proteins and then label the

proteins at specific times and see how much

newly synthesized their protein was

at various times. And similarly, for RNA, using

a technology that allowed them to separate newly synthesized

transcripts from the bulk RNA. And once you have

those data, then you can find out what the spread is

in the half lives of proteins and the abundance of proteins. So if you focus on

the left-hand side, these are the

determined half lives for various RNAs in blue

and proteins in red. If you look at the

spread in the red ones, you’ve got at least three

orders of magnitude of range in stability in half

lives for proteins. So that’s really at the

heart of why RNA levels are very poorly predictive

approaching levels, because there’s such a range

of the stability proteins. And the RNAs also, they

spread over probably about one or two orders of

magnitude in the RNA stability. And then here are

the abundances. So you can see that

the range of abundance for average copies

per cell of proteins is extremely large,

from 100 to 10 to the eighth copies per cell. Now if you look at the

degradation rates for protein half lives and RNA

half lives, you can see there’s no correlation. So these are completely

independent processes that determine whether an

RNA is degraded or a protein is degraded. So then when you try to figure

out what the relationship is between RNA levels

and protein levels, you really have to resort to a

set of differential equations to map out what

all the rates are. And if you know all

those rates, then you can estimate what the

relationships will be. And so they did exactly that. And these charts show

what they inferred to be the contribution of

each of these components to protein levels. So on the left-hand

side, these are from cells which

had the most data. And they build a model

on the same cells from which they

collected the data. And in these cells, the RNA

levels account for about 40% of the protein

levels, the variance. And the biggest

thing that affects the abundance of proteins

is rates of translation. And then they took the data

built from one set of cells and tried to use it

to predict outcomes in another set of

cells in replicate. And the results are

kind of similar. They also did it for an entirely

different kind of cell types. In all of these cases,

the precise amounts are going to vary. But you can see that

the red bars, which represent the amount of

information content in the RNA, is less than about half of what

you can get from other sources. So this gets back

to why it’s so hard to infer regulatory networks

solely from RNA levels. So this is the

plot that they get when they compare

protein levels and RNA levels at the

experimental level. And again, you see

that big spread and R squared at about 0.4,

which at the time, they were very proud of. They write several

times in the article, this is the best anyone

has seen to date. But if you incorporate all these

other pieces of information about RNA stability

and protein stability, you can actually get a

very, very good correlation. So once you know the variation

in the protein stability and the RNA stability for each

and every protein and RNA, then you can do a good job

of predicting protein levels from RNA levels. But without all that

data, you can’t. Any questions on this? So what are we going to do then? So we really have two primary

things that we can do. We can try to explicitly model

all of these regulatory steps and include those in

our predictive models and try to build up gene

regulatory networks, protein models that actually include all

those different kinds of data. And we’ll see that

in just a minute. And the other thing we

can try to do is actually, rather than try

to focus on what’s downstream of RNA synthesis,

the protein levels, we can try to focus on what’s

upstream of RNA synthesis and look at what the production

of RNAs– which RNAs are getting turned on

and off– tell us about the signaling pathways

and the transcription factors. And that’s going to be a

topic of one of the upcoming lectures in which Professor

Gifford will look at variations in epigenomic data and

using those variations in epigenomic data to identify

sequences that represent which regulatory proteins are bound

under certain conditions and not others. Questions? Yeah? AUDIENCE: In a

typical experiment, the rate constants for how

many mRNAs or proteins can be estimated? PROFESSOR: So the question was

how many rate constants can you estimate in a

typical experiment? So I should say,

first of all, they’re not typical experiments. Very few people do

this kind of analysis. It’s actually very time

consuming, very expensive. So I think in this one, it was–

I’ll get the numbers roughly wrong– but it was thousands. It was some decent fraction

of the proteome, but not the entire one. But most of the data

set’s papers you’ll read do not include any analysis of

stability rates, degradation rates. They only look at the bulk

abundance of the RNAs. Other questions? OK. So this is an

upcoming lecture where we’re going to actually

try to go backwards. We’re going to say, we

see these changes in RNA. What does that

tell us about what regulatory regions of the

genome were active or not? And then you could

go upstream from that and try to figure out

the signaling pathways. So if I know

changes in RNA, I’ll deduce, as we’ll see in

that upcoming lecture– the sequences– the identity

of the DNA binding proteins. And then I could

try to figure out what the signaling

pathways were that drove those changes in

gene expression. Now later in this lecture, we’ll

talk about the network modeling problem. If assuming you knew these

transcription factors, what could you do to

infer this network? But before we go

to that, I’d like to talk about an interesting

modeling approach that tries to take into account

all these degradatory pathways and look specifically at

each kind of regulation as an explicit step in

the model and see how that copes with some of these issues. So this is work

from Josh Stewart. And one of the first

papers is here. We’ll look at some

later ones as well. And the idea here is to

explicitly, as I said, deal with many, many

different steps in regulation and try to be quite specific

about what kinds of data are informing about what

step in the process. So we measure the

things in the bottom here– arrays that tell us

how many copies of a gene there are in the genome,

especially in cancer. And you can get

big changes of what are called copy number,

amplifications, or deletions of large chunks of chromosomes. You need to take

that into account. All the RNA-Seq and microarrays

that we were talking about in measuring

transcription levels– what do they actually tell us? Well, they give us

some information about what they’re

directly connected to. So the transcriptomic data tells

something about the expression state. But notice they have explicitly

separated the expression state of the RNA from

the protein level. And they separated

the protein level from the protein activity. And they have these

little black boxes in here that represent the different

kinds of regulations. So however many copies of a

gene you have in the genome, there’s some regulatory event,

transcriptional regulation, that determines how

much expression you get at the mRNA level. There’s another

regulatory event here that determines at what

rate those RNAs are turned into proteins. And there are other

regulatory steps here that have to do with

signaling pathways, for example, that determine

whether those proteins are active or not. So we’re going to treat each

of those as separate variables in our model that

are going to be connected by these black boxes. So they call their

algorithm “Paradigm,” and they developed

it in the context of looking at cancer data. In cancer data, the two primary

kinds of information they had were the RNA levels from either

microarray or RNA-Seq and then these copy number

variations, again, representing

amplifications or deletions of chunks of the genome. And what they’re trying

to infer from that is how active different

components are of known signaling pathways. Now the approach that

they used that involved all of those little

black boxes is something called a factor graph. And factor graphs can be

thought of in the same context as Bayesian networks. In fact, Bayesian networks

are a type of factor graph. So if I have a

Bayesian network that represents these three

variables, where they’re directly connected by

edges, in a factor graph, there would be this extra

kind of node– this black box or red box– that’s the factor

that’s going to connect them. So what do these things do? Well, again, they’re

bipartite graphs. They always have these

two different kinds of nodes– the random

variables and the factors. And the reason they’re

called factor graphs is they describe how the global

function– in our case, it’s going to be the global

probability distribution– can be broken down into

factorable components. It can be combined

in a product to look at what the global

probability function is. So if I have some

global function over all the variables, you can think

of this again, specifically, as the probability function–

the joint probability for all the variables in my system–

I want to be able to divide it into a product of

individual terms, where I don’t have all the

variables in each of these f’s. They’re just some

subset of variables. And each of these represents

one of these terms in that global product. The only things that

are in this function, are things to which

it’s directly connected. So these edges exist

solely between a factor and the variables that are

terms in that equation. Is that clear? So in this context,

the variables are going to be nodes. And their allowed

values are going to be whether they’re

activated or not activated. The factors are going to

describe the relationships among those variables. We previously saw those as

being cases of regulation. Is the RNA turned into protein? Is the protein activated? And what we’d like

to be able do is compute marginal probabilities. So we’ve got some

big network that represents our understanding

of all the signaling pathways and all the transcriptional

regulatory networks in a cancer cell. And we want to ask about

a particular pathway or a particular protein,

what’s the probability that this protein or this

pathway is activated, marginalized over all

the other variables? So that’s our goal. Our goal is to find

a way to compute these marginal

probabilities efficiently. And how do you

compute a marginal? Well, obviously you need to

sum over all the configurations of all the variables that

have your particular variable at its value. So if I want to know if

MYC and MAX are active, I set MYC and MAX

equal to active. And then I sum over

all the configurations that are consistent with that. And in general, that

would be hard to do. But the factor graph

gives us an efficient way of figuring out how to do that. I’ll show you in a second. So I have some global function. In this case, this little

factor graph over here, this is the global function. Now remember, these

represent the factors, and they only have

edges to things that are terms in

their equations. So over here, is a

function of x3 and x5. And so it has edges to x3 and

x5, and so on for all of them. And if I want to explicitly

compute the marginal with respect to a

particular variable, so the marginal

with respect to x1 set equal to a, so I’d

have this function with x1 equal to a times the sum over

all possible states of x2, the sum over all possible

states of x3, x4, and x5. Is that clear? That’s just the

definition of a marginal. They introduced a notation in

factor graphs that’s called a “not-sum.” It’s rather terrible, but

the not-sum or summary. So I like this term,

summary, better. The summary over

all the variables. So if I want to figure

out the summary for x1, that’s the sum over

all the other variables of all their

possible states when I set x1 equal to

a, in this case. So it’s purely a definition. So then I can rewrite– and you

can work this through by hand after class– but

I can rewrite this, which is this intuitive way

of thinking of the marginal, in terms of these not-sums,

where each one of these is over all the other

variables that are not the one that’s in the brackets. So that’s just the definition. OK, this hasn’t really

helped us very much, if we don’t have

some efficient way of computing these marginals. And that’s what the

factor graph does. So we’ve got some factor graph. We have this

representation, either in terms of graph or equation,

of how the global function can be partitioned. Now if I take any one

of these factor graphs, and I want to compute

a marginal over a node, I can re-draw the factor graph

so that variable of interest is the root node. Right? Everyone see that these

two representations are completely equivalent? I’ve just yanked

x1 up to the top. So now this is a tree structure. So this is that factor

graph that we just saw drawn as a tree. And this is what’s called

an expression tree, which is going to tell

us how to compute the marginal over the

structure of the graph. So this is just copied

from the previous picture. And now we’re going to come up

with a program for computing these marginals, using

this tree structure. So first I’m going to compute

that summary function– the sum over all sets of the

other variables for everything below this point, starting with

the lowest point in the graph. And we can compute the

summary function there. And that’s this term, the

summary for x3 of just this fE. I do the same thing for

fD, the summary for it. And then I go up a

level in the tree, and I multiply the summary

for everything below it. So I’m going to compute

the product of the summary functions. And I always compute the summary

with respect to the parent. So here the parent was

x3, for both of these. So these are summaries

with respect to x3. Here who’s the parent? x1. And so the summary is to x1. Yes? AUDIENCE: Are there

directed edges? In the sense that in f, in

the example on the right, is fD just relating

how x4 relates to x3? PROFESSOR: That’s exactly right. So the edges represent which

factor you’re related to. So that’s why I can

redraw it in any way. I’m always going to

go from the leaves up. I don’t have to worry about any

directed edges in the graph. Other questions. So what this does

is it gives us a way to officially, overall a

complicated graph structure, compute marginals. And they’re typically

thought of in terms of messages that are being sent

from the bottom of the graph up to the top. And you can have a rule from

computing these marginals. And the rule is as follows. Each vertex waits

for the messages from all of its

children, until it gets its– the messages are

accumulating their way up the graph. And every node is

waiting until it hears from all of its progeny

about what’s going on. And then it sends the signal

up above it to its parent, based on the following rules. A variable node just takes

the product of the children. And a factor node– one of

those little black boxes– computes the summary

for the children and sends that up to the parent. And it’s the summary with

respect to the parent, just like in the

examples before. So this is a formula for

computing single marginals. Now it turns out– I’m not going

to go into details of this. It’s kind of complicated. But you actually can,

based on this core idea, come up with an efficient way of

computing all of the marginals without having to

do this separately for every single one. And that’s called a

message passing algorithm. And if you’re really

interested, you can look into the citation

for how that’s done. So the core idea is that we

can take a representation of our belief of how this

global function– in our case, it’s going to be the joint

probability– factors in terms of particular

biological processes. We can encode what we know about

the regulation in that factor graph, the structure

of the graph. And then we could

have an efficient way of computing the marginals,

which will tell us, given the data,

what’s the probability that this particular

pathway is active? So in this particular case,

in this paradigm model, the variables can

take three states– activated, deactivated,

or unchanged. And this is, in a tumor

setting, for example, you might say the tumor is

just like the wild type cell, or the tumor has activation

with respect to the wild type, or it has a repression with

respect to the wild type. Again, this is the structure of

the factor graph that they’re using and the different kinds

of information that they have. The primary experimental

data are just these arrays that tell us about

SNiPs and copy number variation and then arrays or RNA-Seq to

tell us about the transcript levels. But now they can

encode all sorts of rather complicated

biological functions in the graph structure itself. So transcription

regulation is shown here. Why is the edge from

activity to here? Because we don’t

want to just infer that if there’s more of the

protein, there’s more activity. So we’re actually,

explicitly computing the activity of each protein. So if an RNA gets

transcribed, it’s because some transcription

factor was active. And the transcription

factor might not be active, even if the levels

of the transcription factor are high. That’s one of the

pieces that’s not encoded in all of those

things that were in the dream challenge, that are really

critical for representing the regulatory structure. Similarly, protein

activation– I can have protein that goes from

being present to being active. So think of a

kinase, that itself needs to be phosphorylated

to be active. So that would be

that transition. Some other kinase comes in. And if that other

kinase1 is active, then it can

phosphorylate kinase2 and make that one active. And so it’s pretty

straightforward. You can also represent the

formation of a complex. So the fact that all the

proteins are in the cell doesn’t necessarily mean they’re

forming an active complex. So the next step

then can be here. Only when I have

all of them, would I have activity of the complex. We’ll talk about how AND-like

connections are formed. And then they also

can incorporate OR. So what does that mean? So if I know that all

members of the gene family can do something, I might

want to explicitly represent that gene family as an element

to the graph– a variable. Is any member of

this family active? And so that would be

done this way, where if you have an OR-like

function here, then this factor would make this gene

active if any of the parents are active. So there, they

give a toy example, where they’re trying to figure

out if the P53 pathway is active, so MDM2 is

an inhibitor of P53. P53 can be an

activator-related apoptosis. And so for separately,

for MDM2 and for P53, they have the factor graphs

that show the relationship between copy number

variation and transcript level and protein

level and activity. And those relate to each other. And then those relate to

the apoptotic pathway. So what they want to do

then is take the data that they have, in

terms of these pathways, and they want to compute

the likelihood ratios. What’s the probability

of observing the data, given a hypothesis

that this pathway is active and all my other settings

of the parameters? And compare that

to the probability of the data, given that

that pathway is not active. So this is the kinds

of likelihood ratios we’ve been seeing now

in a couple of lectures. So now it gets into the details

of how you actually do this. So there’s a lot of manual

steps involved here. So if I want to encode a

regulatory pathway as a factor graph, it’s currently done in a

manual way or semi-manual way. You convert what’s

in the databases into the structure

or factor graph. And you make a

series of decisions about exactly how

you want to do that. You can argue with the

particular decisions they made, but the

reasonable ones. People could do

things differently. So they convert the regulatory

networks into graphs. And then they have to

define some of the functions on this graph. So they define the expected

state of a variable, based on the state

of its parents. And they take a majority

vote of the parents. So a parent that’s connected by

a positive edge, meaning it’s an activator, if the

parent is active, then it contributes a

plus 1 to the child. If it’s connected by

a repressive edge, then the parenting active

would make a vote of minus 1 for the child. And you take the majority

vote of all those votes. So that’s what this says. But the nice thing is that you

can also incorporate logic. So for example, when

we said, is any member of this pathway active? And you have a

family member node. So that can be done

with an OR function. And there, it’s

these same factors that will determine–

so some of these edges are going to get

labeled “maximum” or “minimum,” that

tell you what’s the expected value of the

child, based on the parent. So if it’s an OR, then if any

of the parents are active, then the child is active. And if it’s AND, you

need all of them. So you could have described

all of these networks by Bayesian networks. But the advantage

of a factor graph is that your explicitly

able to include all these steps to describe this

regulation in an intuitive way. So you can go back

to your models and understand what

you’ve done, and change it in an obvious way. Now critically, we’re

not trying to learn the structure of the

graph from the data. We’re imposing the

structure of the graph. We still need to learn

a lot of variables, and that’s done using

expectation maximization, as we saw in the

Bayesian networks. And then, again,

it’s a factor graph, which primarily means that we

can factor the global function into all of these factor nodes. So the total probability

is normalized, but it’s the product

of these factors which have to do with just the

variables that are connected to that factor

node in the graph. And this notation

that you’ll see if you look through

this, this notation means the setting

of all the variables consistent with something. So let’s see that– here we go. So this here, this is the

setting of all the variables X, consistent with the data

that we have– so the data being the arrays, the

RNA-Seq, if you had it. And so we want to compute

the marginal probability of some particular variable

being at a particular setting, given the fully

specified factor graph. And we just take the product

of all of these marginals. Is that clear? Consistent with all

the settings where that variable is

set to x equals a. Questions? OK. And we can compute the

likelihood function in the same way. So then what actually happens

when you try to do this? So they give an example here in

this more recent paper, where it’s basically a toy example. But they’re modeling all

of these different states in the cells. So G are the number

of genomic copies, T, the level of transcripts. Those are connected by a factor

to what you actually measure. So there is some true

change in the number of copies in the cell. And then there’s what

appears in your array. There’s some true number of

copies of RNA in the cell. And then there’s what you

get out of your RNA-Seq. So that’s what these

factors are present– and then these are

regulatory terms. So how much transcript you get

depends on these two variables, the epigenetic state

of the promoter and the regulatory proteins

that interact with it. How much transcript

gets turned into protein depends on regulatory proteins. And those are determined by

upstream signaling events. And how much protein

becomes active, again, is determined by the

upstream signaling events. And then those can have effects

on downstream pathways as well. So then in this toy example,

they’re looking at MYC/MAX. They’re trying to figure out

whether it’s active or not. So we’ve got this pathway. PAK2 represses MYC/MAX. MYC/MAX activates these two

genes and represses this one. And so if these were

the data that we had coming from copy number

variation, DNA methylation, and RNA expression, then I’d

see that the following states of the downstream genes–

this one’s active. This one’s repressed. This one’s active. This one’s repressed. They infer that

MYC/MAX is active. Oh, but what about the fact

that this one should also be activated? That can be explained

away by the fact that there’s a difference in the

epigenetic state between ENO1 and the other two. And then the belief

propagation allows us to transfer that information

upward through the graph to figure out, OK, so now we’ve

decided that MYC/MAX is active. And that gives us information

about the state of the proteins upstream of it and the

activity then of PAK2, which is a repressor of MYC/MAX. Questions on the factor

graphs specifically or anything’s that

come up until now? So this has all been

reasoning on known pathways. One of the big promises of

these systematic approaches is the hope that we can

discover new pathways. Can we discover things we

don’t already know about? And for this, we’re going to

look at interactome graphs, so graphs that are

built primarily from high throughput

protein-protein interaction data, but could also

be built, as we’ll see, from other kinds of

large-scale connections. And we’re going to look

at what the underlying structure of these

networks could be. And so they could

arise from a graph where you put an edge

between two nodes if their co-expressed, if they

have high mutual information. That’s what we saw

in say, ARACHNE, which we talked

about a lecture ago. Or, if say, the two hybrids

and affinity capture mass spec indicated direct

physical interaction or say a high throughput

genetic screen indicated a genetic interaction. These are going to be

very, very large graphs. And we’re going to look at some

of the algorithmic problems that we have dealing

with huge graphs and how to compress the

information down so we get some piece of the network

that’s quite interpretable. And we’ll look at

various kinds of ways of analyzing these graphs

that are listed here. So one of the advantages of

dealing with data in the graph formulation is that we can

leverage the fact that computer science has dealt with large

graphs for quite a while now, often in the context

of telecommunications. Now big data, Facebook,

Google– they’re always dealing with things in

a graph formulation. So there are a lot of algorithms

that we can take advantage of. We’re going to look at

how to use quick distance calculations on graphs. And we’ll look at that

specifically in an example of how to find the kinase

target relationships. Then we’ll look at how

to cluster large graphs to find subgraphs

that either represents an interesting

topological feature of the inherent

structure of the graph or perhaps to represent

active pieces of the network. And then we’ll look at

other kinds of optimization techniques to help us find

the part of the network that’s most relevant to our particular

experimental setting. So let’s start with

ostensibly a simple problem. I know a lot about– I have a

lot of protein phosphorylation data. I’d like to figure

out what kinase was that phosphorylated

a particular protein. So let’s say I have

this protein that’s involved in cancer

signaling, Rad50. And I know it’s phosphorylated

these two sites. And I have the sequences

of those sites. So what kinds of tools do

we have at our disposal if I have a set of

sequences that I believe are phosphorylated,

that would help me try to figure out what

kinase did the phosphorylation? Any ideas? So if I know the specificity of

the kinases, what could I do? I could look for

a sequence match between the specificity

of the kinase and the sequence of

the protein, right? In the same way that

we can look for a match between the specificity

of a transcription factor and the region of the

genome to which it binds. So if I have a library

of specificity motifs for different kinases,

where every position here represents a piece of

the recognition element, and the height of the letters

represent the information content, I can scan those. And I can see what

family of kinases are most likely to be

responsible for phosphorylating these sites. But again, those are

families of kinases. There are many

individual members of each of those families. So how to I find

the specific member of that family that’s

most likely to carry out the regulation? So here, what happens

in this paper. It’s called [? “Network.” ?]

And as they say, well, let’s use the graph properties. Let’s try to figure out which

proteins are physically linked relatively closely in the

network to the target. So in this case, they’ve

got Rad50 over here. And they’re trying to figure out

which kinase is regulating it. So here are two kinases that

have similar specificity. But this one’s

directly connected in the interaction

that works so it’s more likely to be responsible. And here’s the member

of the kinase that seems to be consistent with the

sequence being phosphorylated over here. It’s not directly connected,

but it’s relatively close. And so that’s also a

highly probable member, compared to one that’s

more distantly related. So in general, if I’ve

got a set of kinases that are all of equally good

sequence matches to the target sequence, represented by these

dash lines, but one of them is physically linked as well,

perhaps directly and perhaps indirectly, I have

higher confidence in this kinase because

of its physical links than I do in these. So that’s fine if you want

to look at things one by one. But if you want to look

at this at a global scale, we need very

efficient algorithms for figuring out what the

distance is in this interaction network between any

kinase and any target. So how do you go about

officially computing distances? Well that’s where converting

things into a graph structure is helpful. So when we talk

about graphs here, we mean sets of vertices and

the edges that connect them. The vertices, in our case,

are going to be proteins. The edges are going

to perhaps represent physical interactions or some

of these other kinds of graphs we talked about. These graphs can be directed,

or they can the undirected. Undirected would be what? For example, say two hybrid. I don’t know which one’s

doing what to which. I just know that two

proteins can come together. Whereas a directed edge

might be this kinase phosphorylates this target. And so it’s a directed edge. I can have weights

associated with these edges. We’ll see in a

second how we can use that to encode our confidence

that the edge represents a true physical interaction. We can also talk about the

degree, the number of edges that come into a

node or leave a node. And for our point,

it’s rather important to talk about the path,

the set of vertices that can get me from one

node to another node, without ever retracing my steps. And we’re going to

talk about path length, so if my graph is

unweighted, that’s just the number of

edges along the path. But if my graph

has edge weights, it’s going to be the sum of the

edge weights along that path. Is that clear? And then we’re going to

use an adjacency matrix to represent the graphs. So I have two completely

equivalent formulations of the graph. One is the picture on

the left-hand side, and the other one is the matrix

on the right-hand side, where a 1 between any row

and column represents the presence of an edge. So the only edge connecting

node 1 goes to node 2. Whereas, node 2 is connected

both to node 1 and to node 3. Hopefully, that agrees. OK. Is that clear? And if I have a weighted

graph, then instead of putting zeros or

ones in the matrix, I’ll put the actual edge

weights in the matrix. So there are algorithms that

exist for officially finding shortest paths in large graphs. So we can very

rapidly, for example, compute the shortest path

between any two nodes, based solely on that

adjacency matrix. Now why are we going to

look at weighted graphs? Because that gives us the

way to encode our confidence in the underlying data. So because the total

distance in network is the sum of the edge weights,

if I set my edge weights to be negative log

of a probability, then if I sum all

the edge weights, I’m taking the product of

all those probabilities. And so the shortest

path is going to be the most probable

path as well, because it’s going to be the minimum of

the sum of the negative log. So it’s going to be the maximum

of the joint probability. Is that clear? OK. Very good. So by encoding our network as a

weighted graph, where the edge weights are minus log

of the probability, then when I use these standard

algorithms for finding the shortest path

between any two nodes, I’m also getting the most

probable path between these two proteins. So where these edge

weights come from? So if my network consists

say of affinity capture mass spec and two hybrid

interactions, how would I compute the edge

of weights for that network? We actually explicitly

talked about this just a lecture or two ago. So I have all this

affinity capture mass spec, two hybrid data. And I want to

assign a probability to every edge that tells me how

confident I am that it’s real. So we already saw that in

the context of this paper where we use Bayesian networks

and gold standards to compute the probability for every

single edge in the interactome. OK. So that works pretty well if you

can define the gold standards. It turns out that that has

not been the most popular way of dealing with mammalian data. It works pretty well

for yeast, but it’s not what’s used primarily

in mammalian data. So in mammalian data, the

databases are much larger. The number of gold

standards are fewer. People rely on more

ad hoc methods. One of the big advances,

technically, for the field was the development of a common

way for all these databases of protein-protein interactions

to report their data, to be able to interchange them. There’s something called

PSICQUIC and PSISCORE, that allow a client to

pull information from all the different

databases of protein-protein interactions. And because you can get all

the data in a common format where it’s traceable back to

the underlying experiment, then you can start

computing confidence scores based on these

properties, what we know about where the data came

from in a high throughput way. Different people have

different approaches to computing those scores. So there’s a common

format for that as well, which is

this PSISCORE where you can build your interaction

database from whichever one of these underlying

databases you want, filter it however you want. And then send your database to

one of these scoring servers. And they’ll send

you back the scores according to their algorithm. One that I kind of like this

is this Miscore algorithm. It digs down into

the underlying data of what kind of

experiments were done and how many

experiments were done. Again, they make all sorts

of arbitrary decisions in how they do that. But the arbitrary

decisions seem reasonable in the absence of

any other data. So their scores are based on

these three kinds of terms– how many publications

there are associated with any interaction, what

experimental method was used, and then also, if there’s an

annotation in the database saying that we know that this

is a genetic interaction, or we know that it’s a

physical interaction. And then they put weights

on all of these things. So people can argue

about what the best way of approaching this is. The fundamental point

is that we can now have a very, very

large database of known interactions as weighted. So by last count,

there are about 250,000 protein-protein interactions

for humans in these databases. So you have that

giant interactome. It’s got all these scores

associated with it. And now we can dive

into that and say, these data are somewhat largely

unbiased by our prior notions about what’s important. They’re built up from

high throughput data. So unlike the carefully

curated pathways that are what everybody’s

been studying for decades, there might be information

here about pathways no one knows about. Can we find those pathways

in different contexts? What can we learn from that? So one early thing

people can do is just try to find pieces

of the network that seem to be modular,

where there are more interactions among the

components of that module than they are to other

pieces of the network. And you can find those

modules in two different ways. One is just based on

the underlying network. And one is based on the

network, plus some external data you have. So one would be

to say, are there proteins that fundamentally

interact with each other under all possible settings? And then we would say, in

my particular patient sample or my disease or

my microorganism, which proteins seem

to be functioning in this particular condition? So one is the topological model. That’s just the network itself. And one is the functional model,

where I layer onto information that the dark nodes are active

in my particular condition. So an early use of

this kind of approach was to try to annotate

nodes– a large fraction of even well studied

genomes that we don’t know the function of

any of those genes. So what if I use the

structure of the network to infer that if some protein

is close to another protein in this interaction

network, it is likely to have similar function? And statistically,

that’s definitely true. So this graph shows, for things

for where we know the function, the semantic similarity

in the y-axis, the distance in the

network in the x-axis, things that are close to

each other in the network of interactions, are

also more likely to be similar in terms of function. So how do we go

about doing that? So let’s say we

have got this graph. We’ve got some unknown

node labeled u. And we’ve got two

known nodes in black. And we want to systematically

deduce for every example like this, every u, what

its annotation should be. So I could just look

at its neighbors, and depending on how I

set the window around it, do I look at the

immediate neighbors? Do I go two out? Do I go three out? I could get different answers. So if I set K equal to 1,

I’ve got the unknown node, but all the neighbors

are also unknown. If I go two steps out,

then I pick up two knowns. Now there’s a fundamental

assumption going on here that the node has the same

function as its neighbors, which is fine when the

neighbors are homogeneous. But what do you do when the

neighbors are heterogeneous? So in this case, I’ve

got two unknowns u and v. And if I just were to take

the K nearest neighbors, they would have the same

neighborhood, right? But I might have a prior

expectation that u is more like the black nodes, and v is

more like the grey nodes. So how do you choose

the best annotation? The K nearest neighbors is

OK, but it’s not the optimal. So here’s one approach,

which says the following. I’m going to go through

for every function, every annotation in my

database, separately. And for each annotation,

I’ll set all the nodes that have that annotation to

plus 1 and every node that doesn’t have that

annotation, either it’s unknown or it’s got some other

annotation, to minus 1. And then for every

unknown, I’m going to try to find the

setting which is going to maximize the sum of products. So we’re going to take the

sum of the products of u and all of its neighbors. So in this setting,

if I set u to plus 1, then I do better than if I

set it to minus 1, right? Because I’ll get plus

1 plus 1 minus 1. So that will be better

than setting it to minus 1. Yes. AUDIENCE: Are we ignoring

all the end weights? PROFESSOR: In this case, we’re

ignoring the end weights. We’ll come back to using

the end weights later. But this was done with

an unweighted graph. AUDIENCE: [INAUDIBLE]

[? nearest neighborhood ?] they’re using it then? PROFESSOR: So here they’re

using the nearest neighbors. That’s right, with

no cutoff, right? So any interaction. So then we could iterate

this into convergence. That’s one problem with this. But maybe a more

fundamental problem is that you’re never going to

get the best overall solution by this local

optimization procedure. So consider a setting like this. Remember, I’m trying

to maximize the sum of the product of the

settings for neighbors. So how could I ever– it seems

plausible that all A, B, and C here, should have the

red annotation, right? But if I set C to red,

that doesn’t help me. If I set A to red,

that doesn’t help me. If I set B to red, it

makes things worse. So no local change is going

to get me where I want to go. So let’s think for a second. What algorithms

have we already seen that could help us get

to the right answer? We can’t get here by

local optimization. We need to find the global

minimum, not the local minimum. So what algorithms

have we seen that help us find that

global minimum? Yeah, sorry, so a video

simulated annealing. So the simulated annealing

version in this setting is as follows. I initialize the graph. I pick a neighboring node,

v, that I’m going to add. Say we’ll turn one of these red. I check the value of that sum of

the products for this new one. And if it’s improving

things, I keep it. But the critical thing

is, if it doesn’t improve, if it makes things

worse, I still keep it with some probability. It’s based on how bad

things have gotten. And by doing this,

we can climb the hill and get over to

some global optimum. So we saw simulating before. In what context? When in the side chain

placement problem. Here we’re seeing it again. It’s quite broad. Any time you’ve got a

local optimization that doesn’t get you

where you need to go, you need global optimization. You can think

simulated annealing. It’s quite often the

plausible way to go. All right. So this is one approach

for annotation. We also wanted to see

whether we could discover inherent structure

in these graphs. So often, we’ll be

interested in trying to find clusters in a graph. Some graphs have obvious

structures in them. Other graphs, it’s a

little less obvious. What algorithms exist

for trying to do this? We’re going to look at two

relatively straightforward ways. One is called edge

betweenness clustering and the other one

is a Markov process. Edge betweenness, I think,

is the most intuitive. So I look at each

edge, and I ask for all pairs of

nodes in the graph, does the shortest path

between those nodes pass through this edge? So if I look at this edge,

very few shortest paths go through this edge, right? Just the shortest path

for those two nodes. But if I look at this edge,

all of the shortest paths between any node on this side

and any node on this side have to pass through there. So that has a high betweenness. So if I want a cluster, I

can go through my graph. I can compute betweenness. I take the edge that has

the highest betweenness, and I remove it from my graph. And then I repeat. And I’ll be slowly

breaking my graph down into chunks that are relatively

more connected internally than they are to

things in other pieces. Any questions? So that’s an entire

edge betweenness clustering algorithm. Pretty straightforward. Now an alternative is a

Markov clustering method. And the Markov

clustering method is based on the idea of

random walks in the graph. So again, let’s try to

develop some intuition here. If I start at some

node over here, and I randomly wander

across this graph, I’m more likely to stay

on the left-hand side than I am to move all the way

across to the right-hand side, correct? So can I formalize that

and actually come up with a measure of how

often any node will visit any other and then use

that to cluster the graph? So remember our

adjacency matrix, which just represented which

nodes were connected to which. And what happens if I multiply

the adjacency matrix by itself? So I raise it to some power. Well, if I multiply the

adjacency matrix by itself just once, the squared adjacency

matrix of the property, that it tells me how

many paths of length 2 exists between any two nodes. So the adjacency matrix told

me how many paths of length 1 exist. Right? You’re directly connected. If I squared the

adjacency matrix, it tells me how many

paths of length 2 exist. N-th power tells me how many

paths of length N exist. So let’s see if that works. This claims that

there are exactly two paths that connect

node 2 to node 2. What are those two paths? Connect node 2 to node 2. I go here, and I go back. That’s the path of length 2, and

this is the path of length 2. And there are zero

paths of length 2 that connect node 2 to

node three, because 1, 2. I’m not back at 3. So that’s from general

A to the N equals m, if there exists exactly m paths

of length N between those two nodes. So how does this help me? Well, when you take that idea of

the N-th power of the adjacency matrix and convert it to a

transition probability matrix, simply by normalizing. So if I were to do a

random walk in this graph, what’s the probability

that I’ll move from node i to node j in a certain

number of steps? That’s what I want to compute. So I need to have a

stochastic matrix, where the sum of

the probabilities for any transition is 1. I have to end up somewhere. I either end up back

in myself, or I end up at some other nodes. I’m just going to take

that adjacency matrix and normalize the columns. And then that gives me

the stochastic matrix. And then I can exponentiate

the stochastic matrix to figure out my probability

of moving from any node to any other in a

certain number of steps. Any questions on that? OK. So if we simply keep multiplying

this stochasticity matrix, we’ll get the probability of

increasing numbers of moves. But it doesn’t give us sharp

partitions of the matrix. So to do a Markov clustering,

we do an exponentiation of this matrix with

what’s called an inflation operator, which

is the following. This inflation operator

takes the r-th power of the adjacency matrix

and puts a denominator, the sum of the powers

of the transition. So here’s an example. Let’s say I’ve got two

probabilities– 0.9 and 0.1. When I inflate it, I

square the numerator, and I square each element

of the denominator. Now I’ve gone from 0.9

to 0.99 and 0.1 to 0.01. So this inflation

operator exaggerates all my probabilities and makes

the higher probabilities more probable and makes the lower

probabilities even less probable. So I take this

adjacency matrix that represents the number

of steps in my matrix, and I exaggerate it with

the inflation operator. And that takes the

basic clustering, and it makes it more compact. So the algorithm for this

Markov clustering is as follows. I start with a graph. I add loops to the graph. Why do I add loops? Because I need some probability

that I stay in the same place, right? And in a normal

adjacency matrix, you can’t stay in

the same place. You have to go somewhere. So I add a loop. So there’s always a self loop. Then I set the inflation

parameter to some value. M_1 is the matrix of random

walks in the original graph. I multiply that. I inflate it. And then I find the difference. And I do that until the

difference in this– because this matrix

gets below some value. And what I end up with then

are relatively sharp partitions of the overall structure. So I’ll show you an

example of how that works. So in this case,

the authors were using a matrix where the

nodes represented proteins. The edges represented

BLAST hits. And what they wanted

to do was find families of proteins

that had similar sequence similarity to each other. But they didn’t want it to be

entirely dominated by domains. So they figured that this graph

structure would be helpful, because you’d get–

for any protein, there’d be edges,

not just things that had similar common

domains, but also things that had edges connecting it

to other proteins as well. So in the original graph, the

edges are these BLAST values. They come up with the

transition matrix. They convert into

the Markov matrix, and they carry out

that exponentiation. And what they end

up with are clusters where any individual domain

can appear multiple clusters. The domains are dominated not

just by the highest BLAST hit, but by the whole network

property of what other proteins they’re connected to. And it’s also been done with a

network, where the underlying network represents

gene expression, and edges between two

genes represent the degree of correlation of the expression

across a very large data set for 61 mouse tissues. And once again, you

take the overall graph, and you can break it

down into clusters, where you can find

functional annotations for specific clusters. Any questions then on

the Markov clustering? So these are two

separate ways of looking at the underlying

structure of a graph. We had the edge betweenness

clustering and the Markov clustering. Now when you do this, you

have to make some decision, as I found this cluster. Now how do I decide

what it’s doing? So you need to do some

sort of annotation. So once I have a

cluster, how am I going to assign a

function to that cluster? So one thing I could

do would be to look at things that already

have an annotation. So I got some cluster. Maybe two members

of this cluster have an annotation and

two members of this one. And that’s fine. But what do I do

when a cluster has a whole bunch of

different annotations? So I could be arbitrary. I could just take the one

that’s the most common. But a nice way to do it is by

the hypergeometric distribution that you saw in the earlier

part of the semester. So these are all ways of

clustering the underlying graph without any reference

to specific data for a particular condition

that you’re interested in. A slightly harder

problem is when I do have those

specific data, and I’d like to find a piece

of the network that’s most relevant to

those specific data. So it could be different

in different settings. Maybe the part of

the network that’s relevant in the

cancer setting is not the part of the network that’s

relevant in the diabetes setting. So one way to think about this

is that I have the network, and I paint onto

it my expression data or my proteomic data. And then I want to find

chunks of the network that are enriched in activity. So this is sometimes called

the active subgraph problem. And how do we find

the active subgraph? Well, it’s not that

different from the problem that we just looked at. So if I want to figure out a

piece of the network that’s active, I could just take the

things that are immediately connected to each other. That doesn’t give me

the global picture. So instead why

don’t I try to find larger chunks of

the network where I can include some

nodes for which I do not have specific data? And one way that’s

been done for that is, again, the simulated

annealing approach. So you can try to find

pieces of the network that maximize the

probability that all the things in the

subnetwork are active. Another formulation

of this problem is something that’s called

the Steiner tree problem. And in the Steiner tree,

I want to find trees in the network that consist of

all the nodes that are active, plus some nodes that are not,

for which I have no data. And those nodes for

which I have no data are called Steiner nodes. And this was a problem that

was looked at extensively in telecommunications. So if I want to wire up

a bunch of buildings– back when people used wires–

say to give telephone service, so I need to figure out

what the minimum cost is for wiring them all up. And sometimes, that involves

sticking a pole in the ground, then having everybody

communicate to that pole. So if I’ve got paying

customers over here, and I want to wire

them to each other, I could run wires

between everybody. But I don’t have to. If I stick a pole over here,

then I don’t need this wire, and I don’t need this wire,

and I don’t need this wire. So this is what’s

called a Steiner node. And so in graph theory, there

are pretty efficient algorithms for finding a Steiner graph–

the Steiner tree– the smallest tree that connects

all of the nodes. Now the problem in our setting

is that we don’t necessarily want to connect every

node, because we’re going to have in our

data some things that are false positives. And if we connect too

many things in our graph, we end up with what are

lovingly called “hairballs.” So I’ll give you a

specific example of that. Here’s some data that

we were working with. We had a relatively small

number of experimental hits that were detected as

changing in a cancer setting and the

interactome graph. And if you simply look

for the shortest path, I should say, between

the experimental hits across the

interactome, you end up with something that looks very

similar to the interactome. So you start off with a

relatively small set of nodes, and you try to find

the subnetwork that includes everything. And you get a giant graph. And it’s very hard

to figure out what to do with a graph

that’s this big. I mean, there may be

some information here, but you’ve taken a

relatively simple problem to try to understand the

relationship among these hits. And you’ve turned it

into a problem that now involves hundreds

and hundreds of nodes. So these kinds of

problems arise, as I said, in part, because of

noise in the data. So some of these

hits are not real. And incorporating

those, obviously, makes me take very long

paths in the interactome, but also arises because of

the noise in the interactome– both false positives

and false negatives. So I have two proteins

that I’m trying to connect, and there’s a false

positive in the interactome. It’s going to draw

a line between them. If there’s a false negative

in the interactome, maybe these things really do

interact, but there’s no edge. If I force the algorithm

to find a connection, it probably can, because

most of the interactome is one giant

connected component. But it could be a

very, very long edge. It goes through

many other proteins. And so in the process of

trying to connect all my data, I can get extremely

large graphs. So to avoid having

giant networks– so on this projector,

unfortunately, you can’t see this very well. But there are a lot of edges

among all the nodes here. Most of you have your computers. You can look at it there. So in a Steiner tree

approach, if my data are the ones that are yellow,

they’re called terminals. And the grey ones,

I have no data. And I ask to try to solve

the Steiner tree problem, it’s going to have to find a

way to connect this node up to the rest of the network. But if this one’s

a false positive, that’s not the desired outcome. So there are

optimization techniques that actually allow me to

tell the algorithm that it’s OK to leave out some of the data

to get a more compact network. So one of those approaches

is called a prize collecting Steiner tree problem. And the idea here

is the following. For every node for which

I have experimental data, I associate with

that node a prize. The prize is larger,

the more confident I am that that node is

relevant in the experiment. And for every edge,

I take the edge away, and I convert it into a cost. If I have a high confidence

edge, there’s a low cost. It’s cheap. Low confidence edges are

going to be very expensive. And now I ask the

algorithm to try to connect up all

the things it can. Every time it includes a

node for which the zeta keeps the prize, but it had to add

an edge, so it pays the cost. So there’s a trade-off

for every node. So if the algorithm wants

to include this node, then it’s going to pay the

price for all the edges, but it gets to keep the node. So the optimization

function is the following. For every vertex that’s not in

the tree, there’s a penalty. And for every edge in

the tree, there’s a cost. And you want to minimize

the sum of these two terms. You want to minimize the number

of edge costs you pay for. And you want to

minimize the number of prizes you leave behind. Is that clear? So then the algorithm then can,

depending on the optimization terms, figure out is it more of

a benefit to include this node, keep the prize, and pay all

the edge costs or the opposite? Throw it out. You don’t get to keep

the prize, but you don’t have to pay

the edge costs. And so that turns these

very, very large networks into relatively compact ones. Now solving this problem is

actually rather computationally challenging. You can do it with integer

linear programming. It takes a huge

amount of memory. There’s also signal and

message passing approach. If you’re interested in

the underlying algorithms, you can look at some

of these papers. So what happens when

you actually do this? So that hairball that

I showed you before consisted of a very

small initial data set. If you do a shortest path

search across the network, you get thousands

of edges shown here. But the prize collecting Steiner

tree solution to this problem is actually extremely compact,

and it consists of subnetworks. You can cluster

it automatically. This was clustered by hand,

but you get more or less the same results. It’s just not quite as pretty. If you cluster by hand or by

say, edge betweenness, then you get subnetworks

that are enriched in various reasonable

cellular processes. This was a network

built from cancer data. And you can see things that are

highly relevant to cancer– DNA damage, cell cycle, and so on. And the really nice

thing about this then is it gives you

a very focused way to then go and do experiments. So you can take the networks

that come out of it. And now you’re not

operating on a network that consists of tens of

thousands of edges. You’re working on a

network that consists of very small sets of proteins. So in this particular

case, we actually were able to go in and test

the number of the nodes that were not detected by

the experimental data, but were inferred by the

algorithms of the Steiner nodes, which had no

direct experimental data. We will test whether blocking

the activities of these nodes had any effect on the

growth of these tumor cells. We will show that

nodes that were very central to the

network that were included in the prize collecting

Steiner tree solution, had a high probability

of being cancer targets. Whereas the ones that were

just slightly more removed were much lower in probability. So one of the advantages of

these large interaction graphs is they give us a

natural way to integrate many different kinds of data. So we already saw that the

protein levels and the mRNA levels agreed very

poorly with each other. And we talked about

the fact that one thing you could do with

those data would be to try to find the

connections between not the RNAs and the proteins,

but the connections between the RNAs

and the things that drove the expression of the RNA. And so as I said, we’ll see

in one of Professor Gifford’s lectures, precisely

how to do that. But once you are able to do

that, you take epigenetic data, look at the regions that are

regulatory around the sites of genes that are

changing in transcription. You can infer DNA

binding proteins. And then you can

pile all those data onto an interaction

graph, where you’ve got different kinds of edges. So you’ve got RNA nodes that

represent the transcript levels. You’ve got the

transcription factors that infer from the

epigenetic data. And then you’ve got the

protein-protein interaction data that came from

the two hybrid, the affinity capture mass spec. And now you can put all

those different kinds of data in the same graph. And even though

there’s no correlation between what happens in an RNA

and what happens in the protein level– or very

low correlation– there’s this

physical process that links that RNA up

to the signaling pathways that are above it. And by using the prize

collecting Steiner tree approaches, you can rediscover. And these kinds

of networks can be very valuable for other kinds

of data that don’t agree. So it’s not unique to transcript

data and proteome data. Turns out there are many

different kinds of omic data, when looked at individually,

give you very different views of what’s going on in a cell. So if you take knockout data,

so which genes when knocked out, affect the phenotype? And which genes, in

the same condition, change an expression? Those give you two

completely different answers about which genes are important

in a particular setting. So here we’re looking at

which genes are differentially expressed when you put

cells under a whole bunch of these different conditions. And which genes

when knocked out, affect viability

in that condition. And then the right-hand

column shows the overlap in the number of genes. And you can see the

overlap is small. In fact, it’s less

than you would expect by chance

for most of these. So just to drill that home, if

I do two separate experiments on exactly the same

experimental system, say yeast responding

to DNA damage. And in one case,

I read out which genes are important by

looking at RNA levels. And the other one,

I read out which genes are important by knocking

every gene out and seeing whether it affects viability. We’ll get two completely

different sets of genes. And we’ll also have two

completely different sets of gene ontology categories. But there is some underlying

biological process that gives rise to that, right? And one of the

reasons for this is different assays are

measuring different things. So it turns out, if you

look– at least in yeast– over 156 different

experiments, for which there’s both transcriptional

data and genetic data, the things that come

out in genetic screens seem to be master regulators. Things that were knocked out

have a big effect in phenotype. Whereas the things that

change in expression tend to be effector molecules. And so in say, the

DNA damage case, the proteins that

were knocked out and have a big

effect on phenotype are ones that detect DNA damage

and signal to the nucleus that there’s been

changes in DNA damage that then goes on and

blocks the cell cycle, initiates DNA

response to repair. Those things show

up as genetic hits, but they don’t show up as

differentially expressed. The things that do show up

as differentially expressed, the repair enzymes. Those, when you

knock them out, don’t have a big effect on

phenotype, because they’re highly redundant. But there are these

underlying pathways. And so the idea is well, you

could reconstruct these by, again, using the

epigenetic data, the tough stuff

Professor Gifford will talk about in

upcoming lectures. And for the transcription

factors and then the network properties, to try to build

up a full network of how those relate to upstream

signaling pathways that would then include

some of the genetic hits. I think I’ll skip to

the punchline here. So we’ve looked at a number of

different modeling approaches for these large interactomes. We’ve also looked at

ways of identifying transcriptional

regulatory networks using mutual information,

regression, Bayesian networks. And how do all these

things fit together? And when would you want to

use one of these techniques, and when would you

want to use another? So I like to think about the

problem along these two axes. On one dimension,

we’re thinking about whether we have systems of

known components or unknown components. And the other one

is whether we want to identify physical

relationships or statistical relationships. So clustering, regression,

mutual information– those are very, very

powerful for looking at the entire genome,

the entire proteome. What they give you are

statistical relationships. There’s no guarantee of

a functional link, right? We saw that in the prediction

that postprandial laughter predicts breast cancer

outcome, that there’s no causal link between those. Ultimately, you can

find some reason why it’s not totally random. But it’s not as if

that’s going to lead you to new drug targets. But those can be on a

completely hypothesis-free way, with no external data. Bayesian networks are

somewhat more causal. But depending on how

much data you have, they may not be

perfectly causal. You need a lot of

intervention data. We also saw that they did

not perform particularly well in discovering

gene regulatory networks in the dream challenge. These interactome

models that we’ve just been talking about work very

well across giant omic data sets. And they require

this external data. They need the interactome. So it works well in

organisms for which you have all that

interactome data. It’s not going to work in an

organism for which you don’t. What they give you

at the end, though, is a graph that tells

you relationships among the proteins. But it doesn’t tell

you what’s going to happen if you start to

perturb those networks. So if I give you the

active subgraph that has all the proteins and genes

that are changing expression in my tumor sample, now

the question is, OK, should you inhibit the

nodes in that graph? Or should you activate

the nodes in that graph? And the interactome

model doesn’t tell you the answer to that. And so what you’re going to

hear about in the next lecture from Professor

Lauffenburger are models that live up in this space. Once you’ve defined a relatively

small piece of the network, you can use other kinds

of approaches– logic based models, differential

equation based models, decision trees, and other techniques

that will actually make very quantitative

processions. What happens if I inhibit

a particular node? Does it activate the process,

or does it repress the process? And so what you could

think about then is going from a completely

unbiased view of what’s going in a cell, collect all

the various kinds of omic data, and go through these

kinds of modeling approaches to identify a

subnetwork that’s of interest. And then use the techniques

that we’ll [? be hearing ?] about in the next lecture

to figure out quantitatively what would happen if I were

to inhibit individual nodes or inhibit combinations of

nodes or activate, and so on. Any questions on anything

we’ve talked about so far? Yes. AUDIENCE: Can you say again

the fundamental difference between why you get those two

different results if you’re just weeding out the gene

expression versus the proteins? PROFESSOR: Oh, sure. Right. So we talked about the fact that

if you look at genetic hits, and you look at

differential expression, you get two completely different

views of what’s going in cells. So why is that? So the genetic hits to tend to

hit master regulators, things that when you knock

out a single gene, you have a global

effect on the response. So in the case of

DNA damage, those are things that

detect the DNA damage. Those genes tend often not

to be changing very much in expression. So transcription factors

are very low abundance. They usually don’t

change very much. A lot of signaling proteins

are kept at a constant level, and they’re regulated

post-transcriptionally. So those don’t show up in

the differential expression. The things that are

changing in expression– say the response regulators,

the DNA damage response– those often are redundant. So one good analogy is to

think about a smoke detector. A smoke detector

is on all the time. You don’t wait until the fire. So that’s not going to be

changing in expression, if you will. But if you knock it out,

you’ve got a big problem. The effectors, say

the sprinklers– the sprinklers only come

on when there’s a fire. So that’s like the

response genes. They come on only in

certain circumstances, but they’re highly redundant. Any room will have

multiple sprinklers, so if one gets

damaged or is blocked, you still get a response. So that’s why you get this

discrepancy between the two different kinds of data. But again, in both

cases, there’s an underlying physical process

that gives rise to both. And if you do this

properly, you can detect that on these

interactome models. Other questions? OK. Very good.

Beautiful Lecture – Thank you.