Have you ever had two #chromosome 9's? Well today I have.

One more reason that I prefer #R Dataframes to #Python Dataframes (#pandas). In R, there is rarely any uncertainty when it comes to loading in genomic data.

Image1 shows a 140k row table generated with Pandas containing just "9" or "X" for the chromosome.

Image2 shows how that dataset is read easily by R, but misinterpreted by pandas unless you set the datatypes yourself.

I've heard of #chromosome_duplication but this is pushing it

@mtekman well, the column has mixed types [Int, Str] This is what Python tells you. Isn't it preferred to get a warning instead of casting values internally? Is explicit better then implicit, or not? :)

@bgruening I'm of the opinion that a series / vector should be of a single type (but I guess that's another issue :D). I think the warning should be an error in some cases:

https://github.com/pyranges/pyranges/issues/375

PyRanges read_bed produces wrong number of chromosomes when cast to categorical · Issue #375 · pyranges/pyranges

First referenced here: sims-lab/CapCruncher#234 (comment) Content posted below: Create Genomic Data import pandas as pd import numpy as np ## Create a dataframe with large number of rows n = 150000...

GitHub