Signal data format support (e.g., BigWig) #56

marcomass · 2017-06-23T09:22:52Z

Enable the use of signal data sample (e.g. BigWig) in some operands, e.g., as second operand of MAP (or in COVER, to be discussed).
Possibly/probably a specific "special" versio,n of the defined MAP operator could be better.
Examples of BigWig files (from 0.6 to 1.5 GB) are available at https://www.encodeproject.org/experiments/ENCSR620VIC/

akaitoua · 2017-07-04T15:16:19Z

@marcomass, As i understand that BigWig is a binary indexed version of Wiggle format. And Wiggle format is compressed, less accurate, version of BedGraph. Why do not we use BedGraph and always convert BigWig and Wig files to BedGraph ?

marcomass · 2017-07-05T10:47:13Z

@akaitoua
You are right that BigWig can be converted in BedGraph (so BigWig is not less accurate than BedGraph). Yet, BedGraph takes much more space than BigWig, so nobody use it, and all use BigWig, as in the provided link.

In any case, this issue regards two aspects:

the format of the input data (BigWig)
the efficiency of the MAP operation when signal data (BigWig or BedGraph) are used as experiment dataset.
You could postpone the aspect 1. and check before the aspect 2 (by converting before BigWig to BedGraph); yet, I think they are connected, since possibly BigWig enables to access specific portions of the file directly, without scanning it entirely. This would enable to access only the portion of the BigWig file related to the reference regions in the reference dataset of the MAP, improving performances.
Of course this requires a new, specific, MAP for experiment datasets of signal data, to process this kind of data differently from the bed ones.

akaitoua · 2017-07-06T16:05:48Z

@marcomass, I check it and these are more details. I suggests to support only BigGraph format since it does not change our data model. So when ever we copy data into GMQL we change the format to what we call GMQL_WIG => which is a BEDGraph but in columnar format, which is binary that GMQL can read and small in Size in fomparison to BEDGraph. Then we store GMQL_WIG in our repository.

Why not BIGWIG and WIG for GMQL, is because we are performing different type of queries than the others in the field. We are performing always a full join between the reference and the experiment (set of regions in the reference almost equal size to the experiment sample). In case we will start supporting an interval joins (which is like selecting small portion of the BIGWIG file) then it is better to change GMQL to index which will be faster in this case.

marcomass · 2017-07-06T16:26:22Z

@akaitoua
Ok. How can we change the format to GMQL_WIG when copying data into GMQL?
What has to be the input format of this transformation? BEDGraph?

Do you think that using BEDGraph (or GMQL_WIG) as an experiment dataset of a MAP using genes as reference regions in the reference dataset (thus, about 25000 for human) would be handle by the current system with reasonable performance?

marcomass added this to the Version 2.3 milestone Jun 23, 2017

marcomass assigned akaitoua Jun 23, 2017

akaitoua removed their assignment Aug 30, 2017

akaitoua added the enhancement label Sep 4, 2017

marcomass modified the milestones: Version 2.4, Version 2.3 Sep 12, 2017

akaitoua added the help wanted label Sep 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal data format support (e.g., BigWig) #56

Signal data format support (e.g., BigWig) #56

marcomass commented Jun 23, 2017

akaitoua commented Jul 4, 2017

marcomass commented Jul 5, 2017

akaitoua commented Jul 6, 2017

marcomass commented Jul 6, 2017

Signal data format support (e.g., BigWig) #56

Signal data format support (e.g., BigWig) #56

Comments

marcomass commented Jun 23, 2017

akaitoua commented Jul 4, 2017

marcomass commented Jul 5, 2017

akaitoua commented Jul 6, 2017

marcomass commented Jul 6, 2017