(Translated by https://www.hiragana.jp/)
Signal data format support (e.g., BigWig) · Issue #56 · DEIB-GECO/GMQL · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal data format support (e.g., BigWig) #56

Open
marcomass opened this issue Jun 23, 2017 · 4 comments
Open

Signal data format support (e.g., BigWig) #56

marcomass opened this issue Jun 23, 2017 · 4 comments

Comments

@marcomass
Copy link
Contributor

Enable the use of signal data sample (e.g. BigWig) in some operands, e.g., as second operand of MAP (or in COVER, to be discussed).
Possibly/probably a specific "special" versio,n of the defined MAP operator could be better.
Examples of BigWig files (from 0.6 to 1.5 GB) are available at https://www.encodeproject.org/experiments/ENCSR620VIC/

@marcomass marcomass added this to the Version 2.3 milestone Jun 23, 2017
@akaitoua
Copy link
Contributor

akaitoua commented Jul 4, 2017

@marcomass, As i understand that BigWig is a binary indexed version of Wiggle format. And Wiggle format is compressed, less accurate, version of BedGraph. Why do not we use BedGraph and always convert BigWig and Wig files to BedGraph ?

@marcomass
Copy link
Contributor Author

@akaitoua
You are right that BigWig can be converted in BedGraph (so BigWig is not less accurate than BedGraph). Yet, BedGraph takes much more space than BigWig, so nobody use it, and all use BigWig, as in the provided link.

In any case, this issue regards two aspects:

  1. the format of the input data (BigWig)
  2. the efficiency of the MAP operation when signal data (BigWig or BedGraph) are used as experiment dataset.
    You could postpone the aspect 1. and check before the aspect 2 (by converting before BigWig to BedGraph); yet, I think they are connected, since possibly BigWig enables to access specific portions of the file directly, without scanning it entirely. This would enable to access only the portion of the BigWig file related to the reference regions in the reference dataset of the MAP, improving performances.
    Of course this requires a new, specific, MAP for experiment datasets of signal data, to process this kind of data differently from the bed ones.

@akaitoua
Copy link
Contributor

akaitoua commented Jul 6, 2017

@marcomass, I check it and these are more details. I suggests to support only BigGraph format since it does not change our data model. So when ever we copy data into GMQL we change the format to what we call GMQL_WIG => which is a BEDGraph but in columnar format, which is binary that GMQL can read and small in Size in fomparison to BEDGraph. Then we store GMQL_WIG in our repository.

Why not BIGWIG and WIG for GMQL, is because we are performing different type of queries than the others in the field. We are performing always a full join between the reference and the experiment (set of regions in the reference almost equal size to the experiment sample). In case we will start supporting an interval joins (which is like selecting small portion of the BIGWIG file) then it is better to change GMQL to index which will be faster in this case.

@marcomass
Copy link
Contributor Author

@akaitoua
Ok. How can we change the format to GMQL_WIG when copying data into GMQL?
What has to be the input format of this transformation? BEDGraph?

Do you think that using BEDGraph (or GMQL_WIG) as an experiment dataset of a MAP using genes as reference regions in the reference dataset (thus, about 25000 for human) would be handle by the current system with reasonable performance?

@akaitoua akaitoua removed their assignment Aug 30, 2017
@marcomass marcomass modified the milestones: Version 2.4, Version 2.3 Sep 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants