| Data Overview (with Disclaimer) |
|
|
|
| Written by Ken McNally |
| Tuesday, 07 October 2008 10:12 |
|
This is release 1 of the OryzaSNP database. Currently, only Perlegen model-based SNP predictions are annotated, but we anticipate that in the near future we will be able to add predictions based on machine learning methods (Clark et al. 2007 Science 317:338, see below). As to experimental design, Perlegen designed 25-mer oligos with single base offsets that were tiled across the 100 Mb fraction of the genome for both strands with the 13th base in full degeneracy with each target position in the reference sequence interrogated by 8 oligos. Independent long-range PCR amplicons were produced for target pools across the regions arrayed on a particular wafer or chip. These LR-PCR amplicons were pooled, labeled and hybridized to the wafers. Due to PCR failure, sequence complexity of the target, and/or repetitiveness (significant similarity across the 25-mer features), not all arrayed features can give assays of consistent quality and uniformity. It is important to note, that analogous to Sanger dideoxy sequencing, the Perlegen hybridization-based sequencing method has varying quality. Thus, it is essential that prior to using the data, you be aware there are quality values associated with each base call derived from the Perlegen data and that when possible, you select SNPs with the highest quality values for your work. We encourage users to select SNP sites with the highest quality scores. Assessment of data quality is ongoing and we encourage you to frequently check the OryzaSNP website to learn of new analyses regarding data quality assessments or improvements. We are applying the machine learning approach at MPI-Tubingen (Clark et al. 2007) in order to identify polymorphisms that are not recognized by the Perlgen model-based algorithm. For the machine learning, we have produced dideoxy sequences across a sampling of tiled regions for each of the 20 varieties. These data will be used to assess false discovery rates and true positive rates of SNP detection, and the dideoxy sequence data will also serve as the training set for machine-learning analysis. The intersection of the Perlegen model-based calls and the MPI machine-learning calls will give the set with the highest confidence. The machine-learning calls will be annotated in a subsequent release of the database. ReferencesClark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D. (2007) Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 317(5836):338-42.
|
| Last Updated ( Wednesday, 29 October 2008 09:58 ) |




