From: Anthony Appleyard <XPUM04@prime-a.central-services.umist.ac.uk>
To: KRVW <@NSFnet-Relay.AC.UK:KRVW@sei.cmu.edu>
Date:         Fri, 01 Jun 90 16:07:48 BST 
Message-Id:   <$TGTWCZCFFBRV at UMPA>
Subject:      Virus-L vol 0 issue #0516


Virus-L Digest Mon, 16 May 88, Volume 0 : Issue #0516

Today's Topics

Various protection schemes against executable-infecting viruses
Re: software self-checks
RE: XMAS EXEC add'l comments
detecting damaged data
Re:  Signature lengths
need RISKS LOG
Re:  The neverending chain....

------------------------------

Date:         Mon, 16 May 88 04:42:20 EDT
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         Amanda B Rosen <abr1@cunixc.columbia.edu>
Subject:      Various protection schemes against executable-infecting viruses

Hi... first posting for me...

About the one-to-one mapping, I believe that it has already been mentioned
that data compressors generate unique "signatures". In fact, for any desired
amount of certainty, there will be a spectrum of algorithms you can use,
ranging from storage-intensive, computationally undemanding ones to ones
which make great demands on the CPU but generate much smaller signatures.
For example, given a desire for 100% certainty, you could copy the file (for
a disk-intensive scheme) or Huffman-encode it (CPU-bound). Likewise, for
less certainty you could keep a large checksum or a small CRC.

Without giving the matter much thought (yet), I wonder if Huffman encoding
might not be a useful tool in the Virus wars. Of course, it makes large
demands on the processor, but you don't need to huf the whole file. I am
guessing now, but I bet that a 'database' type analysis which huf'ed a
file and kept just the table could defeat every file-modifying virus out
there today. Unfortunately this takes a lot of time. The $64K question is,
are there any Huffman-type schemes which take less time, but have the same
basic character?

The reason I am intruiged by Huffman is that it analyzes the character of
the data in the file. An example of another, very simple analysis scheme is
one that counts the number of occurences of each byte value in the file. This
would be difficult to defeat, because the virus would probably have to alter
such a large part of the infected program to conform to the validation check
that the program wouldn't be able to execute at all.

Why does everyone assume that a 'database' validation scheme must use only one
type of check? Having a variety of simple checks available and randomly
choosing two for each file to be validated (you store the types of the checks
along with their results) should be enough to stop any virus.

So... am I missing something?
/a

--------------------

Date:         Mon, 16 May 88 14:03:00 LCL
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         Michael Wagner new! +49 228 8199645 <WAGNER@DBNGMD21>
Subject:      Re: software self-checks

> From jim frost <madd@bu-it.bu.edu>
> |> A couple of comments to this.  Yes, it'd slow the spread of
> |> viruses, but it would also make you less paranoid about them (and
> |> thus less likely to catch them),
> |       ----
> |  I assume this should have been MORE likely to catch them?
>
> No, I meant less.  If a virus was built that circumvented the checks,
> you'd probably never find it because you're not looking for it under
> the assumption that if it were there, you'd be told.

  When I catch a virus, that is to me like catching a cold, i.e.
  contracting, getting sick, etc.  That is how I read your comment.

  I think you meant catching as in detecting, trapping, immobilizing,
  etc.

  It seems I understood the opposite meaning from what you intended.
  Ah, the dangers of mixing vocabularies willy nilly from different
  fields.

Michael

--------------------

Date:         Mon, 16 May 88 12:18:00 PDT
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         WETTERN@OREGON
Subject:      RE: XMAS EXEC add'l comments

I just wanted to let you know that there are at least two legitimate programs
called XMAS EXEC out there.  One of them, for example, is a cute little
animated thing coming from Japan.
So, when next xmas comes around and somebody sends you a XMAS EXEC, check it
out before you run or discard it. It might be a virus or something legitimate.

--------------------

Date:         Mon, 16 May 88 14:07:00 EDT
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         Woody <WWEAVER@DREW>
Subject:      detecting damaged data

I think we've become a bit distracted on the problem of virus detection.
There seem to be two essential problems: first, making sure that the program
you recieve is free of viruses (and destructive code -- i.e. the authorship
problem) and ensuring that your data does not become contaminated over time.

The current discussion seems to be revolving about the latter half of the
problem: we would like to ensure that the executable image we are about to
launch (or data base we are about to access) has not been altered by a
malefic agent.  Someone (sorry, threw away that message) proposed storing
a CRC signature for each file.  Jerry Leichter (LEICHTER@YALEVMS) responded

>Unfortunately, it is very easy to modify a file so that any given CRC remains
>unaltered.  The protection you get by using a CRC is thus quite limited - it
>stops those virus-writers who aren't clever enough to work around it.

This is an inherent problem: if we use a simple protection scheme, then the
virus can include code to defeat it.  Varying what is meant as "simple",
while looking at the code to CHECKUP, Kenneth R. van Wyk (LUKEN@LEHIIBM1)
remarked

>This raises an interesting question - is it truly possible to write
>a checksum/crc/whatever algorithm that will be able to figure out
>whether or not a file has been changed 100% of the time?  Let's assume
>that the signature data has *not* been tampered with in any way.  It
>is no secret that both checksums and crcs can be circumvented rather
>easily.  But, an algorithm which could validate a file with 100%
>effectiveness could be very worthwhile looking into.  Once again, the
>validity of the signature data itself would have to be insured via
>encryption and/or read-only isolation.

David M. Chess (CHESS@YKTVMV) gave an argument to answer van Wyk's question
in the negative: essentially, it is that the number of possible messages
is larger than the number of possible (small) signatures.  No one - to - one
function maps a larger set into a smaller set.  However, Leichter's
original post goes on to point out the following:

>I would suggest a compromise:  A mechanism that, while not completely secure,
>makes things very hard on a virus-maker while not requiring huge amounts of
>computation.  When a program is published, a large number of random polynomials
>(say, 100 or 1000) are chosen, using the techniques of Rabin's paper.  The CRC
>of the program with ALL the polynomials is published.  To check a program, you
>chose any two of the polynomials - also published - compute the CRC's, and
>compare.  (You must, of course, chose those two at random.)  The virus maker
>must make his program have the same CRC with respect to ALL the 100 or 1000
>polynomials - which is possible, but requires (I believe, this would need to be
>studied) a LOT of computation.  (The length should probably be checked - it's
>going to be a lot easier on the virus maker if he can add extra stuff to the
>end of the program to make the CRC's come out right.)

Professor Leichter is trying to simultaneously solve the authorship problem and
the preservation of data problem by publishing this list with the program.
One of the problems with this is that we still have to find some way to
publicly distribute this list, and then have to store this list (no small
overhead!).  However, if we just consider trying to stabilize the data, this is
quite a useful system.  When the file is recieved (or legally modified) a
collection of 100 or 1000 (or 10, site dependent CRC polynomial signatures) are
chosen and stored separate from the file.

Now, at appropriate intervals, one or two of the CRC polynomials from the
stored list are selected, are computed for the file, and compared against the
stored signatures.  If a difference is detected, we know the file is
contaminated and appropriate measures can be taken.

In some sense, this is appealing to Arnold Gill's intuitive feel for the
effect of CRC error detection:

>     As I understand CRCs, the CRC is a particular polynomial function
>of the bytes within a file.  If that is the case, would it not be
>possible to devise two (or three) orthonormal CRC polynomials, such that
>any illegal change in the programs constant would necessitate a change
>in one or more of the CRCs?  A virus could make one of the CRCs come
>out all right, but several orthonormal ones?

If the virus knows which (small number of) CRC's are going to be checked, it
(theoretically) can alter the data to preserve all signatures.  If it doesn't
know which CRC's are going to be checked, it can only preserve them randomly.
If we can treat the virus as a random rather than intelligent agent, then
we achieve detection approaching that of noisy telephone lines, for which
CRC's are excellent.

The chief advantage of this scheme is that we are drawing upon the essence of
zero knowledge proof to verify the identity of the data.  While a virus
could contaminate a specific site (because for a specific site the inquiry is
predetermined: the virus need only defeat a small number of detection
functions) general transmission of the virus is not possible; once such a
virus is detected at one site, avenues of communication (such as this list)
will alert other sites where it may have escaped detection and it can be
killed net-wide.

An important feature of this scheme is that it requires a small amount of
overhead (only a few signatures stored at any one site, instead of all of the
signatures stored at every site) and can be done quickly (little more than
reading the entire file).  It seems clear to me at least that detection
schemes that require extensive storage or processing will not be enacted --
I mean, given sufficient storage, one would simply keep a spare copy of the
data on a WORM -- so this at least has a possibility of being followed in
the interests of good computer hygiene.

--------------------

Date:         Mon, 16 May 88 14:37:41 EDT
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         riacs!ames!hc!csed-1!csed-47!roskos@rutgers.edu
Subject:      Re:  Signature lengths

>> It is not strictly true that a signature must be as large as the file checked
>> to avoid many-to-one mapping.  An easy way to prove this is data compression.
>
>That's a point; the true statement is more like that the signature of
>a *random* file must be, on average, as large as the file to be checked ...

No, that's not quite right.  All the "counterexample" shows is that there
may be a bijective mapping from the set of all signatures of n bits to some
*subset* (of size 2~n) of the set of all files of m bits, m > n.  As I
mentioned in my previous message to VIRUS-L(1), it is the *number of distinct
values the program file can take on* that determines the size of the signature
required, not the number of bits in the program file per se.  If you
allow the n-bit signature to include a program with more than n bits, there
will be a program of less than or equal to n bits for which there is no
n-bit signature.

A compression algorithm just tries to take the most "frequently occurring"
values of an m-bit file and give them representations in n bits.  Consider
that it is not possible to compress all files, as Mr. Chess points out;
that is because in n bits you can't represent all m bit files, just as
you can't represent all m bit files in an n-bit signature.

In other words, compression would be analogous to finding a function that
assigns a unique n-bit number to each *existing* program you have (where there
are less than 2~n programs in existence).  This may well be possible
(although in the worst case the function might have to have a copy of each
program to compare with the one being tested).  Really, though, the function
would have to do slightly more than that; it would also have to return
some special value for all programs that didn't match, so you can tell
when you have a modified one.

>> Another thought for budding compu-epidemiologists;  There is a set of unique
>> machine instructions that a virus must use to work.
>
>Could you elaborate on that?  The instructions that a virus needs to work
>are generally the same instructions that any other program needs to do
>anything else (adds, moves, operating-system calls).

That's true.   It's not really the set of unique machine instructions,
but rather the instruction *sequences*.  Correctly-working programs use
these same instructions (maybe through a mediator, the operating system),
they just don't use them to infect other programs.

However, machines that provide "hardware protection" features do use
exactly the principle stated in the part with the ">>" above; these are
the "privileged instructions" (which may take the form of privileged
system calls).  If you restrict this set of instructions to be used only
by code which is proven to protect against virus-operation (this
includes protecting the protection code itself, and other things generally
associated with a secure system), then you have a system which is"secure," which is the source of much work these days.

But, notice that it gets increasingly hard to fully protect a system
in this way; for example, you would have to have the restriction that
only compilers can modify object files, for instance; i.e., the system
would have to distinguish writes to a user's file that result from the
user recompiling the program, from writes to the same file that occur
due to some other program trying to modify it to insert a virus.

Now that I think about it, that is not as hard as it first seems, it's
just not something that is usually provided nowadays.  It would require
one to be very rigorous in the design, but it does seem as though
it could be done...

- ----
(1) I haven't seen my previous message appear on the list, though, so it
    may have been lost on the way to LEHIIBM1.

Eric Roskos, IDA (csed-1!roskos@DAITC.ARPA or Roskos@DOCKMASTER.ARPA)

--------------------

Date:         Mon, 16 May 88 15:01:23 EDT
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         riacs!ames!hc!csed-1!csed-47!roskos@rutgers.edu
Subject:      need RISKS LOG

Is there anyone out there who would be willing to send me the RISKS LOG
on a floppy disk?  I will send you the postage required.  I am preparing
a report on viruses and need the log fairly soon (at least by the first
of next week).  I have been trying for several weeks to get a copy via
GET RISKS LOG but, although I get the report back from the list server
saying the job has run, the log never arrives.  Apparently someone's mailer
is discarding it along the way.

If you are willing to do this, email me about it so we can work out how
to send it.  Thanks... -- Eric Roskos

--------------------

Date:         Mon, 16 May 88 14:47:08 EDT
Reply-To:     Virus Discussion List <VIRUS-L@LEHIIBM1>
Sender:       Virus Discussion List <VIRUS-L@LEHIIBM1>
From:         riacs!ames!hc!csed-1!csed-47!roskos@rutgers.edu
Subject:      Re:  The neverending chain....

        In the future, "hardware" protection is probably going to be a
        neccessity.  Hopefully, it won't inhibit system "friendliness" and
        utility too much.  (Remember, the most secure and protected system to
        date is one which is totally isolated and restricted...and who wants
        one of those? Yeeek)

Actually, the goal of a secure system should be to prevent "illegal" accesses
while providing a minimum of interference to "legal" accesses.  This is what
makes it more challenging.  Note, though, that *some* overhead is necessary,
simply because it requires more work to check that an access is "legal" than
just to allow all accesses.  But it doesn't follow that the overhead has
to be intolerably high...