Issue109

Title (Kirq) Handle changed datasets
Priority bug Status chatting
Superseder Nosy List christopher, cjr
Assigned To christopher Keywords

Created on 2012-07-30.22:02:36 by cjr, last changed by cjr.

Files
File name Uploaded Type Edit Remove
fvwm.screenshot.20120802-081937.jpeg cjr, 2012-08-02.13:29:53 image/jpeg
fvwm.screenshot.20120802-124630.jpeg cjr, 2012-08-02.18:28:30 image/jpeg
Messages
msg386 (view) Author: cjr Date: 2012-08-02.21:24:29
Would it be better to table this issue entirely for now?  Recognizing
and handling changed data sets isn't a show-stopping issue, and if
you're going to rework the session history structure later anyway,
these issues can be addressed then.
msg385 (view) Author: christopher Date: 2012-08-02.20:25:32
Do you have a simplified list of steps for that bug? I can't reproduce
anything like that.

I think what you are describing is that my solution doesn't work
because it verifies files are the same and not datasets. I was hashing
the file like this:

    >>> md5.md5(file(filename, 'rb').read()).hexdigest()

The problem is that if we try to hash the the dataset as a string
we take a huge performance hit (e.g. str(dataset)). I tested it with
the excel sheets and it doesn't like them for some reason.

It would be impossible to control files from getting any extra chars
in them that might upset our md5. We should just store the reference
to the dataset on the signature and compare those directly. Basically
what you said here,

I'm gonna push a new changeset and I want you to try that. It may have
a few kinks in it to work out but I think it will be more complete at
testing if two datasets really are the same.

I apologize for being blunt but I am going to move the larger
discussion to a different mail thread. This issues is already a bit
overwhelming and I want to focus only on my current implementation
here.
msg384 (view) Author: cjr Date: 2012-08-02.18:28:30
Okay -- looking good, but I'm still seeing one bug.  In the attached
screenshot, take a look at the very last session entry (concov_nec_2,
under stokke/A -- f44282b).  This entry is attached to the wrong
dataset/outcome -- signature entry.  The concov table was actually
generated for an analysis with Y as the outcome (not A), so it should
have been attached to the third to last dataset/outcome entry
(stokke/Y -- f44282b).

Here's the lineage from that entry:  

Thu, 02 Aug 2012 17:45:03 +0000
Consistency Threshold: 0.9
Coverage Threshold: 0.5
Concov Type: nec
Data Set md5 Checksum: f44282b5315162bef193e9e822dc7bdf
Causal Conditions: ['C', 'S', 'I']
Outcome: Y

But this also raises some other thoughts/questions:

(1) What I didn't realize until looking at this output is that the
signature checksum is providing information that could be useful to
the user.  In the attached screenshot, for example, I can easily tell
that the first two analyses were generated from the same dataset.
Similarly, I can tell that the third-to-last and last analyeses were
generated from the same dataset.  What I'm wondering is if we should
make the dataset the root of the tree, with the outcome as the first
branch?  The first two
entries in the attached screenshot would then look something like:

    - breakdown - 227adb4
       |
       -- Sur
       	   |
	   -- concov_nec_1
       |
       -- Brk
       	   |
	   -- concov_nec_1

I realize that this might take some effort to implement, so if we
decide that we want to do it, it doesn't have to be done immediately.

(2)  From the lineage window, it looks like you are taking an MD5 hash
of the dataset?  What exactly are you hashing?  The file itself, the
dataset instance, or something else?  The reason that I'm asking is
because in the attached screenshot the "c022cbf" file has the same
content as the "227adb4" file.

What I had done was: loaded breadown.xls and ran the first analysis.
Then I reloaded breakdown.xls (unchanged) and ran the second analysis.
Both analyses generated the same signature (227adb4), as I expected.  Then I
added a column to breakdown.xls, loaded it into Kirq and ran the third
analysis.  This generated a a new signature (c5614ec), as I expected.  Then I
deleted that new column from breakdown.xls, loaded it into Kirq, and
then ran the fourth analysis.  This time a new signature (c022cbf) was
generated.  What I'm wondering is why didn't it generate the earlier
signature (227adb4)?

Thinking that this may have something to do with reading the data from
Excel, I then tested this same process with stokke.csv.  The first stokke/Y
entry (f44282b) is the original file.  The second stokke/Y entry
(3800bf8) has an additional row.  The final stokke/A entry (f44282b)
has that additional data row deleted.  And I did check the lineage to
see that they definitely have the same full signature (it's not just
that the first 7 characters happen to be the same).

The issue that I think this raises is a question regarding what
exactly is the dataset signature identifying?  The file from which the
dataset is read?  Or the content of the dataset?  My instinct is that
it should be content.  (But I haven't thought too much about this yet,
so maybe I'm missing something.)  If the signature is identifying the
content, this would suggest (for example) that if a user had the exact
same dataset in two different forms (e.g., as both an Excel and CSV
file), Kirq would generate the same signature for both of them.
msg383 (view) Author: christopher Date: 2012-08-02.15:13:13
Sorry about that. I missed a pretty funamental if statement in there
:).

Both of these should be fixed. Please test and let me know. Let me
know if there are more.
msg382 (view) Author: cjr Date: 2012-08-02.13:29:53
I'm seeing a couple of bugs:

Bug 1: I open the breakdown.xls dataset, run a truth table analysis,
change the dataset, and rerun the same analysis.  The new
dataset/outcome -- checksum entry is created correctly.  But the
second truth table is the same as the first one, and the session
window therefore jumps to the first truth table entry.  See attached
screenshot fvwm.screenshot.20120802-081937.jpeg

Bug 2: I open the stokke.csv dataset, and run a necessity analysis
with Y as the outcome.  This all works fine.  I then click on the
dataset/outcome -- checksum entry (bring up the dataset in the main
window) and change my outcome to R (I also had to drop the coverage
threshold to 0.00 but that doesn't impact the bug report).  The
resulting concov table is generated correctly but it's attached to the
previous dataset/outcome -- checksum entry (in this case: stokke/Y --
f44282b).  But a new dataset/outcome -- checksum entry should have
been created (with R as the outcome) and the concov table should be
attached to this new entry.

C.
msg381 (view) Author: christopher Date: 2012-08-01.23:18:44
Ok, I pushed an implementation I want you to try out. Seems to work
pretty well here.

I don't have any good ideas yet for the naming convention so I'm using
the first 7 letters of the checksum. We can change this as soon as we
have a good idea. I want to test the implementation first.
msg380 (view) Author: cjr Date: 2012-08-01.22:28:38
I don't know.  I was hoping that you might have a good idea!  At a
minimum, just incrementing the entries would probably be okay.
Something like: dataset/Outcome-#

I guess that a form of dataset-#/Oucome would be more accurate, but
I'd worry that the increment would get lost.

Using a timestamp is another possibility, although real estate is
already tight and I don't know that embedding a timestamp would be all
that useful.

Plus, users can always rename the entries themselves if they want.
msg379 (view) Author: christopher Date: 2012-08-01.22:03:12
Ah, I see what you mean. This makes complete sense. Good thinking and
I will implement.

One question, how should we indicate that the new/changed dataset is
different? How would you like to do that?
msg378 (view) Author: cjr Date: 2012-07-31.23:20:36
By "hash the dataset," I was thinking of hashing in the same form as
taking an MD5 hash of a file--different MD5 sums indicate a difference
in the (content of the) files (but don't tell you what the differences
are).  The existing dataset signature may already serve this purpose.

The workflow that I had in mind is a little bit different than your
description:

(1) User opens dataset (e.g., "foo.xls") in Kirq and runs an analysis
(2) User changes foo.xls

At this point, nothing would change within Kirq.  In other words, Kirq
doesn't need to monitor the external dataset for changes.

There are three (I think) possible workflows after the user has
changed foo.xls.  This first two are how Kirq is already designed;
it's the third one that's different.

(3) If the user wants to run another analysis on the unchanged dataset
(i.e., the dataset that was used in step 1) and using the same
outcome, they simply click on the dataset/outcome entry in the session
window and run a new analysis.  Kirq already has this data and doesn't
need to reread foo.xls.  After running the analysis, Kirq would attach
the results (gtt/concov table) underneath the existing
dataset/outcome.  This is how things currently work.

(4) If the user wants to run another analysis on the unchanged dataset
but with a new outcome, they simply click on the dataset/outcome entry
in the session window, specify the new outcome, and run the new
analysis.  Again, Kirq doesn't reread foo.xls.  After running the
analysis, Kirq creates a new dataset/outcome entry in the session
window.  This is how things currently work.

(5) On the other hand, if the user wants to use the changed foo.xls,
they would first select File -> Open Data Set to load the (new) foo.xls.
When they run the new analysis, Kirq would see that it's a new dataset
(even though it has the same name) and therefore create a new
dataset/outcome entry in the session tree.

Does this make sense?  Again, this is just my first shot at thinking
through this issue.  It may be that there's a better solution, or that
it's more complicated than I realize.

Claude
msg377 (view) Author: christopher Date: 2012-07-31.21:37:15
I think I am on the same page as you but I'm having a hard time
understanding a few things. Here is what I think you are saying:

If the user changes a dataset, we reflect the changes immediately in
the current dataset if they are viewing it. On the next analysis, Kirq
creates a new History item, regardless of if an existing one was there
with the same dataset name/outcome. We basically treat it like a
different dataset.

When we click between them we want to see the dataset revision that
reflects when it was committed to history.

Is this right? It shouldn't be difficult I'm just having a hard time
understanding what you mean by "hash the dataset".

I created the signature item a while back just for this reason. Maybe
you meant something like that? We should add more data to it so we
know when two datasets are different.

Am I missing anything?
msg376 (view) Author: cjr Date: 2012-07-30.22:02:35
So here's a kinda nasty question--How should Kirq handle situations in which the
dataset file is changed?  Example: Start Kirq, load the stokke dataset, and run
an analysis with 'Y' as the outcome.  Now, change the stokke dataset in some
way--i.e., add a column, row, or change an existing value--and open the data set
in Kirq to load it.  If you then run another analysis with the same outcome,
everything's fine.  But when you click on the dataset entry in the session file,
the session loads the original dataset.  Note that if you had instead run the
new analysis with a different outcome, then everything would be fine.

It appears (but correct me if I'm wrong) that Kirq is storing the original
dataset as part of the session file.  I think that this is the correct thing to
do; I think that it would cause all sorts of problems if a Kirq session file was
dependent upon external dataset files.

The question is--how do we deal with situations in which a dataset file is
changed?  I think that there are five main types of changes that could occur: a
row is added or deleted, a column is added or deleted, and/or a value is changed
(this includes changing the name of an observation or column).

It is not at all unusual for a dataset to be changed during the course of an
analysis, so we definitely need to handle it.

The first thought that comes to my mind would be to hash the dataset and create
a new dataset/outcome entry in the session tree if the dataset's been changed. 
We might also want to enumerate the session entry.  I think that this would be
easy enough and transparent enough for users to understand (e.g., "If the
dataset has been changed, a new entry will be created in the session tree."). 
But maybe it's a more difficult problem than that or maybe there's a better way?

Any thoughts?
History
Date User Action Args
2012-08-02 21:24:29cjrsetmessages: + msg386
2012-08-02 20:25:32christophersetmessages: + msg385
2012-08-02 18:28:30cjrsetfiles: + fvwm.screenshot.20120802-124630.jpeg
messages: + msg384
2012-08-02 15:13:13christophersetmessages: + msg383
2012-08-02 13:29:54cjrsetfiles: + fvwm.screenshot.20120802-081937.jpeg
messages: + msg382
2012-08-01 23:18:44christophersetmessages: + msg381
2012-08-01 22:28:38cjrsetmessages: + msg380
2012-08-01 22:03:12christophersetmessages: + msg379
2012-07-31 23:20:37cjrsetmessages: + msg378
2012-07-31 21:37:15christophersetstatus: unread -> chatting
messages: + msg377
2012-07-30 22:02:36cjrcreate