Polish Y-DNA R1a Clades

16-Jan-10

Peter Gwozdz

pete2g2@comcast.net

 

           If this is your first time here, consider jumping down to the Abstract for a summary of this web document.

           My methods and results have been recently published.

           The Polish Project has my assignments of men to types as a subdivision of haplogroup R1a, which is a category of Y-DNA.

           This web document is for explanation, details, and update news.

           The Results Table has a summary of assignments.

           For more explanation of R1a subdivision based on STR correlations, see www.gwozdz.org/R1a.html.

 

M458 News

 

           The big news is the M458 data.  P type and N type are coming out M458+, which means R1a1a7.  For details, see the M458 News topic at the top of my R1a page.  On the Polish Project Y-DNA page, I’m using the categories P Borderline and N Borderline to highlight the samples most likely to benefit from the M458 SNP test.

 

Polish Project R1a Assignment News

 

           The Polish Project has 87 results for this new SNP test (on 7 Jan).  All results are consistent with the R1a assignments I have been making for the Polish project.  Check out the topic New Information below for an explanation of the new M458 SNP.

           P type and N type.  These and only these are coming out M458+.  That means if you have all 67 markers and match either P or N you are likely M458+, which means R1a1a7.  If you do not match P or N at 67 markers you are likely M458-, which means R1a1a*.

           Let’s call that a new “P&N rule”.  It has better than 90% confidence overall for predicting M458 results.

           P Borderline & N Borderline:  However, there is low confidence for the 13 samples assigned on the Polish Project Page to “P Borderline” or “N Borderline”.  (A “sample” is the STR data for one man, a member of the Polish Project.)  These samples would benefit from the M458 test, because they are most likely to be rare exceptions to the new P&N rule.  These are only 8% of the untested samples with 67 markers, and they are likely to be M458+ based on the 6 samples like this tested so far, but confidence is low (unknown percent confidence) because 6 is very few results and because some of these have unique STR profiles.

           K Borderline:  These are likely to be M458- but confidence is roughly 80% because there are not many M458 results yet.  In addition, these have an unknown probability of hiding an undiscovered clade of M458+ that does not match P or N.  The confidence is higher for those 3 or 4 steps from the K modal haplotype, and lower for those at 5 or 6 steps.  (Samples less than 3 are assigned to K type and those more than 6 are U67.)  As more samples in K Borderline test for M458 we’ll get higher confidence for the M458- prediction.  Reminder:  K Borderline and U67 are categories, not statistically valid types:  K Borderline is the main R1a tree trunk, largest category at 67 markers, samples that have not yet been classified as types because the STR values are close to a uniform distribution;  U67 are the Unassigned samples, distant from the R1a tree trunk.

           U67:  Confidence is roughly 80% M458- for the same reasons as K Borderline.  As these samples test out M458-, they get moved to the R- category, which is a Remainder category.

           P, N, K, A, B, C, I:  For samples that closely match one of the types, confidence is better than 95% in that new P&N rule for predicting M458 results, based on M458 results so far.  I hesitate to recommend the M458 test for samples with a low chance of learning new information, but there may a surprise waiting to be discovered:  As more of these “close matches” continue to test for M458, we may learn that most of the types have 99% confidence except maybe one type that may turn out something like 80% confidence due to a “hiding” clade from the opposite group with similar STR values by luck.  Maybe not.  Time will tell.

           For samples with less than 67 markers, I expect buying more markers to provide better information than buying the M458 test, because the 67 markers can predict subtypes within the M458+ (R1a1a7) and M458- (R1a1a*).  An exception would be those few samples at 25 or 37 markers that are on the STR border of P or N, where the M458 test can determine the status with confidence.

 

Fall Issue of JoGG

 

           The Fall issue of the Journal of Genetic Genealogy came out on 21 Nov.  My publication is split into two parts there:

           Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.

           Part II is the application of that method to Common Polish Clades.  That article has a lot more detail than this web page, but that article was last updated in September, so this web page is an update.

           PolishCladesUpdate is my folder for future updates to those two articles.

           This web page will continue as an introduction and summary, without as much jargon and detail as the articles and update folder.

 

R1a Worldwide

 

           On 14 Nov, I  moved the general discussion of R1a to a new web page, www.gwozdz.org/R1a.html.

           That new page has the recent test results details for M458.

           This page continues to emphasize Polish data.

 

New Information;  Underhill

 

           A new article was published online, 4 Nov, essentially dividing R1a1 into two groups, based on a new SNP, M458.

                       Abstract                     STR Data                   See www.gwozdz.org/R1a.html for more discussion

           I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.

           This web page about Polish Clades has been completely rewritten using this new information.  Recent test results for M458 are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.

           Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).  See R1a Subdivision for a brief summary of other groups, and for a clarification of what I mean by R1a1a*.

           R1a1a7 is the new M458 haplogroup.  R1a1a7 includes what I have been calling P type and N type here on the web page, even before M458 was available.

           R1a1a* is a new paragroup.  This is M458 negative.  It includes all my other types, particularly K type.

           This Underhill article has data for 158 “Poland” samples (Table 2):

                       R1a1a*:           71 samples      44.9%

                       R1a1a7:           87 samples      55.1%

           The 70% confidence interval for R1a1a7 is about 50% to 60%.

           Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).

           M458 Results are coming in now for the new SNP test and the Polish Project R1a is splitting about evenly, with a few percent more R1a1a7 than R1a1a*, although the latter is more common worldwide.

           7 Jan results:  120 samples are 62 R1a1a* (52%) vs 58 R1a1a7, but the results available to me are biased toward Poland.  Within the Polish Project there are 87 samples, 45 R1a1a7 (52%) vs 42 R1a1a*.

 

Abstract

 

           This Abstract is for people reasonably familiar with the jargon of genetic genealogy.  If you are new to genetic genealogy you might prefer to read the Introduction first.

           This web document has three purposes:  1. More detailed explanations for the men (samples) that I assign to types in the Polish Project.  2. Summary of my published results.  3. Update with recent results.

           The topic is common Polish Y-DNA clades - identification of male line Y-DNA clusters that are concentrated in Poland.

           Since I originally posted this in December 2007, emphasis has been haplogroup R1a, because about half of Polish men are R1a, with no subdivision into smaller haplogroups.  A new division, roughly 50-50, between R1a1a* and R1a1a7, became available in November.  I will soon expand this page to include clades from other haplogroups that seem to be concentrated in Poland.  In December 2009 I rewrote and moved my general R1a analysis to another web document.

           I use the word type to mean an STR cluster with statistical validity as established by my Mountain Method.  I expect my types to be validated some day by discovery of new SNPs that will qualify them as haplogroups.  I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters.  All types have associated clusters but not all clusters qualify as types.  In my publications and web pages I make it clear which types I have discovered in web data and which types were suggested to me by others, with references.  Sometimes I discover a type and later find out someone else had mentioned it earlier on the web;  let me know if you the reader have more clues and references for me.

           Most of types that I have identified seem to be 1,000 to 5,000 years old, so all the men in each type seem to be descended in direct male lines from one man (MRCA) who lived that long ago (TMRCA).  A few of my types might be younger or older than that range.

           I use phrases like “seem to be” over and over because my methods are statistical.  On the Polish Project Y-DNA Results page, I assign samples with at least 80% statistical confidence, which means most assignment have better than 80% probability of being correct.  A few assignments are up to 99% probability.  About 95% of the assignments are correct in my opinion.

           Because of the restriction to 80% confidence, most men in the Polish Project are not assigned to types on at the Polish Project web page.  I provide an Excel File assigning all men, with lower confidence.  That file also has more detailed notes about the assignment method.

           I divided the R1a Polish data into 4 clusters based on STR data.  About half the men of Polish male line ancestry belong to the R1a haplogroup, and that group divides roughly equally into these 4 clusters.  I call them P type, N type, K type, and R.  Only P and N are in R1a1a7.

           R, Remainder, is not a type.  I use R for samples where I have 80% confidence that they do not belong to K type or to one of the rare types other than K that I have identified in R1a1a* so far.

           Borderline clusters are not types but seem close to types.  Each Borderline cluster has discussion below.  I use borderline clusters for samples with all 67 standard markers, where the confidence of assignment to the corresponding type is lower than 80%, but the samples are close enough to deserve a separate category.

           U, Unassigned, is my assignment for samples without confident assignment and with less than the full 67 standard markers.  This is the largest category in the Polish Project;  my comment about 4 equal categories refers to samples with the full 67, taken as representative of ethnic Polish men, with caveats explained in my publication.  I am temporarily using a U67 category for samples with 67 markers that may benefit from the new M458 test;  these will be reclassified as R when the bounds of M458 are established.

           I have 99% overall confidence that P and N correspond to clades, valid subdivisions of R1a1a7.  Individual samples have variable confidence greater than 80% due to the remote chance of unusual outliers from other types.  K type is R1a1a but not R1a1a7.  My overall confidence in K type is only 85% because there seem to be unidentified types with STR values close to K.  The modal haplotype for K is essentially the same as the modal haplotype for all of R1a.  However, I have identified subtypes of K that have much higher confidence.  In other words, in this case I have higher confidence for some individual samples.  I have high confidence in the subtypes although I am not sure all the subtypes belong to exactly the same clade along with all the other samples that I have assigned to K outside the subtypes.  Even if K is not a true clade as defined, however, it clear that all the K samples belong to branches in the R1a1a* tree with nodes very close to each other.  The only uncertainty is that there are likely many other samples that belong in other branches just as close to K.

           P type is concentrated in Poland, rare elsewhere.  N type seems to be mostly Slavic, widespread in eastern Europe.  K type corresponds to one of the two largest R1a1 clusters.  Another large R1a1a cluster, the one I call L type, is not common in Poland.

           The subtypes of K are A, B, and I.  Type C is a rare small type in R1a1a* distinct from K.

           Thanks go to Lawrence Mayka, Polish Project administrator, for extensive email information and assistance.

           You can compare data to my types by clicking this link to instructions for Ysearch.

           Reminder:  I am concentrating on Poland.  The statistics of STR clusters depend a lot on the data base.  For example, P type stands out dramatically in Polish data.  In other countries P type is rare.  If you belong to an R1a1 cluster that is rare in Poland, I’m sorry, but I’m not covering you.  K type is an example of a type that is common both in Poland and elsewhere.  M type is common in northwest Europe but so far absent in the Polish Project.

 

Introduction

 

           This Introduction is for people unfamiliar with the jargon of genetic genealogy.

           There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic.  Back issues of JoGG are good general references.  The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.

           The following several paragraphs are a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages.  The definition words are boldface.  I often use links to those definitions when I use a jargon word for the first time in a topic.  There are more boldface definitions in the summary of my Methods.

           The Y chromosome gets passed from father to son, so it works just like a male family name.  Men are divided into haplogroups based on known rare mutations (called SNP) in the Y chromosome.  Division into haplogroups is done in a manner that has virtually 100% confidence.  I say “virtually” because your confidence in your DNA result from your DNA testing company might be 98% or 99% or 99.9%;  the confidence for haplogroups is better than that.  We can be virtually certain that all the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup.   Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the haplogroup.  Lots of people are working hard to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.  I’m doing some work on this, but I’m not discussing it in this web document.

           Other people, like me in this document, try to “stay ahead” of the haplogroups by analyzing other mutations that are not so rare (called STR) on the Y chromosome.  Men submit their Y-DNA data to various web sites.  There is lots of STR data available on the web.  Men are divided into STR clusters as hypothetical subdivisions of the haplogroups.  All such clusters are hypothetical.  Some will be validated in the future by new SNP discoveries.  There are various statistical methods for estimating the confidence of STR clusters.  I recently published a method that I developed.  That publication has references to other methods.  There is a brief summary of my method below.

           A few STR clusters are small family clusters, with the same family name.  Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line cousins identified by genealogy research, due to secret adoptions, illegitimacies, etc.  This is one of the reasons some people prefer to avoid genetic genealogy.  The male line associated with the Y-chromosome is only one ancestral line.  Humans have 24 chromosomes.  Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then;  the one man at the tip of the male line root is only one of those thousand.  That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many.  That said, many people enjoy the challenging hobby of figuring out to which ancient extended male line they belong.

           Most STR based clusters have an MRCA who lived thousands of years ago, before family names were common, so most men assigned to a typical cluster do not have the same family name.

           Most SNP based haplogroups have an MRCA who lived more than ten thousand years ago, so these span multiple ethnic groups and nationalities.  For example, the R1a haplogroup is of interest to me.  R1a is most common in Slavic countries but calling R1a Slavic is misleading because it is found throughout Europe and west Asia.  The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it.  It is possible that he did not even live in what is now the Slavic region of Europe;  maybe his descendants moved there in a massive migration from the Asian steppes, or from India.  No one knows for sure.  Even if he was proto-Slavic in language and culture, by now some of his descendants long ago moved to other parts of Europe and Asia.  One of the appeals of genetic genealogy is trying to figure out ethnic descent and migration from the statistics of haplogroups.  Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of many.  True enough.  Some individuals and some web sites go too far with genetic claims.  That said, statistical analysis of haplogroup data provides many clues on human origins.

           Again, some people try to stay ahead of haplogroups, using statistical analysis of STR based clusters to gain insight into more recent human origins.  I am one of those people.  My interest is Polish origins.  This web document, however, is not for the historical analysis and conclusions, except for occasional comments to remind us of the goal.  This document is dedicated to STR analysis results, identifying clusters concentrated in Poland, with detailed explanations.

           The bottom of my Method section has more definitions for a number of genetic genealogy terms.

           There are a number of organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy analysis, for example FTDNA.  I am not associated with the company FTDNA;  I mention them because I make extensive use of their data;  check Google for competitors.  At FTDNA, click on Products for cheek swab kits.  DNA results are confidential unless you register the data at a database;  at FTDNA, click on Projects to register your data into one of the many databases;  for example, most of my analysis is from the data in the FTDNA Polish Project.

           I use the FTDNA standard set of 67 STR markers (plus a few non-standard ones occasionally).  I do some analysis using the standard FTDNA 12, 25, or 37 STR marker sets.  Other companies use standard marker sets that may not overlap with all the FTDNA markers.

           Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.

 

Now For The Details

 

           Up to here, I have tried to write this web page as news and summary, with links to more discussion below.  I hope anyone having minimal familiarity with genetic genealogy jargon has understood.  From here on it gets more detailed.  I’m sorry about that, but the audience from now on is readers with genetic genealogy experience who want to know how I came to my conclusions.  If you cannot follow the remainder due to jargon, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.

           If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.

 

Results Table

 

           Polish Project Assignments at 67 Markers, taken as representative of ethnic Polish.

           Click here for  instructions for comparing your sample to the Ysearch links.

           Click on the link in the last column to jump down to more discussion for that type.

 

Cluster

Group

Type

Subtype

Subcluster

Samples

Polish %

Ysearch

Link

P

 

 

 

 

47

8.9%

 

 

 

R1a1a7

P

 

 

41

7.8%

8U92G

P

 

R1a

 

 

PB

6

1.1%

 

PB

N

 

 

 

 

46

8.7%

 

 

 

R1a1a7

N

 

 

32

 6.1%

3SEJK

N

 

R1a

 

 

NB

14

 2.7%

 

NB

K

R1a1a*

K

 

 

61

11.5%

MN8R3

K

 

R1a1a*

 

K*

 

28

5.3%

 

 

 

R1a1a*

 

A

 

9

1.7%

FCUFG

A

 

R1a1a*

 

B

 

6

1.1%

 

B

 

R1a1a*

 

D

 

7

1.3%

 

D

 

R1a1a*

 

I

 

11

2.1%

EKVHX

I

R

 

 

 

 

69

13.1%

 

 

 

R1a1a*

 

 

KB

46

8.7%

 

KB

 

R1a1a*

G

 

 

?

 

ZD29Z

G

 

R1a1a*

C

 

 

1

0.2%

 

C

 

R1a1a*

L

 

 

?

 

 

L

 

R1a1a*

 

M

 

0

0%

24MB4

M

 

 R1a

 

 

R-

5

0.9%

 

R-

 

 R1a

 

 

U67

17

3.2%

 

U67

Totals

 

 

 

 

225

42.2%

 

 

 

           Those Types and Subtypes are my own code letters, for brevity.  Please do not confuse these code letters with official haplogroups.  I have been using such code letters for R1a assignments in the Polish Project for 2 years.  The color coding is for ease of comparison on my web pages.

           This table is from analysis of the Polish Project, using the data with 67 markers.  The percentage results are for data as of 17 Dec 2009, with assignments updated on 9 Jan.  My R1a worldwide document is www.gwozdz.org/R1a.html, including a list of R1a types, along with percentages worldwide, which are not the same as percentages in Poland.

           The 3 types P, N, and K are hypothetical clades of R1a1 Y-DNA.  Insofar as the Polish Project represents Polish male line ancestry, the % column is my best estimate of the frequency of these types in Polish male ancestry.  Each individual assignment to a types or subtype has at least 80% confidence;  the overall table is probably about 95% accurate.

           R cluster is the Remainder, samples (men) who do not belong to P, N, or K types.

           P and N types are R1a1a7 and everything else so far is R1a1a*, although rare exceptions are expected.

           P Borderline and N Borderline (PB and NB) clusters:  These are the samples that will benefit most from the M458 test.  I’m not sure, but I suppose these might be about 70% M458+ (R1a1a7) and 30% M458- (R1a1a*), so I labeled them as group R1a.  I counted the PB as part of the P “cluster” but I distinguish the P type samples that are M458+ verified or at least 80% confidence of being M458+.  Similarly for NB.  In other words, a few of the PB and NB samples are probably not P or N type.  Many of the PB and NB samples were KB before the M458 test came along;  I moved them here in order to emphasize which samples can benefit most from the M458 test.

           K Borderline (KB):  These are samples that have low step compared to the K type modal haplotype, but are not close enough to individually have 80% confidence of belonging to K type.  On the other hand, these do not individually have 80% confidence of not belonging to K type.  A few of these no doubt belong to K type, I estimate about 90% are not, so I put KB in R.  I keep them separate because the rest of R is samples with at least 80% confidence of not belonging to P, N, or K.  The K modal haplotype is essentially the modal haplotype for R1a as a whole, so the R1a tree is very densely populated near K in haplospace.  Some of the KB cluster probably represents small clades that are close to K, with nodes in the tree somewhat older or somewhat younger than the definition node for K.  Actually, B and D types are like this;  for each, I estimate about 70% probability that they have nodes younger than K, qualifying them as subtypes of K;  I estimate about 20% probability they have a node just slightly older than K;  I estimate (actually an educated guess) about 10% probability they are very distant from K with STR values that are similar by coincidence due to the STR values of the MRCA.  So the percent for K in the table above represents a good estimate for the size of K in Poland - a few KB should be added to K, but perhaps one of the subtypes should be subtracted.  However, I emphasize that the individual assignments to these subtypes are high confidence regardless of the exact location of the node for the subtype in the R1a tree.  There is no clear STR border between K, KB, and R;  I estimated the K step value to distinguish them.  I anticipate that many KB clades will be difficult to define with STR values, so many of these may need to wait for SNPs to distinguish them.

           R is the remainder cluster category, including some small types, that are distant from P, N, and K.  Samples with at least 80% confidence to not belong to the any of my defined types go into the R- and U67 subclusters.  R- are the samples that have been tested M458-.  The concept of R is very similar to the concept of R1a*;  if we imagine that someone discovers an SNP that defines the node of K type, and considering that M458+ defines P plus N types, then the R cluster would be a type of R1a* - a hypothetical paragroup of R1a that does not include P, N, or K types.  So far no sample in the Polish Project has been identified from the rare R1a groups;  such a sample would probably have unusual STR values and thereby fall into R- or U67, so I used “R1a” in the table.

           U67 is a temporary category that was set up when the M458 test became available.  These are R samples that have not been tested for M458.  Ordinarily I do not use U for samples with all 67 markers, but in view of the new test these were highlighted U67 to emphasize that they could gain information by trying the new M458 test.  There is still a chance a few of these might come out M458+, representing a new clade.  We now have enough M458 data that I guess each sample has about 70% confidence of being M458-;  in few months, if none of these turn up M458+, I’ll move these into R-.

           I’ll keep the R category even as other clades within R become confident, because it is a good overview to divide Polish R1a into 4 clusters of about equal size.

           M is a subtype of a larger L type.  M and L are common in northwest Europe, but rare in Poland.

           Types N, A, and K do not seem concentrated in Poland but are found mostly in Slavic countries.

           P, D, and I types are concentrated in Poland.

           Click on the links to jump down to more discussion of types.

           The Ysearch links provide the full modal haplotypes, using the indicated number of STR markers, out of the standard FTDNA set of 67 markers.  I entered this data into Ysearch for our convenience.  All my modal haplotype definitions are available in the Excel file Haplotypes.xls, which also has experimental types not mentioned here.  Below are Ysearch instructions for quickly comparing your haplotype to all my types at once.

           G type is experimental, with about 70% confidence at this time.  Pc and Pg are hypothetical subtypes of P type, with less than 80% confidence.  No assignments yet, but confidence of small types is dominated by sampling statistics, which improves rapidly (or fails rapidly) as data accumulates.

           Assignment to types is with 80% net confidence (confidence that the type is valid multiplied by confidence that the sample belongs to the type).  For subtypes of K, confidence that the subtype is indeed a subtype of K is not included in the net confidence.  In other words, samples likely belong to the subtype even if the subtype is placed wrong.  These subtypes very likely have nodes close to the K node, however.

           The estimated percentage for P, N, K, and R in the Results Table add up to 42.2%, which is the percent of R1a in the Polish Project.

 

My Mountain Method

 

           Here is a summary of terms (in boldface) that I defined for my “Mountains in Haplospace” method.  For more explanation, see the fall issue of JoGG.  By haplospace I mean multidimensional sets of STR values;  each haplotype is a point in haplospace.

           A cluster qualifies as a type if the graph of step frequency (number of samples at that step) vs step looks like an isolated mountain.  The step is the genetic distance (mutation count) from the modal haplotype of the cluster.  I use the method of Ysearch to calculate step.  The cutoff is the next step just beyond the mountain.  A good type has low step frequency in a “gap” of step values including the cutoff (only the cutoff for a gap of 1).  In other words, the cluster forms a mountain at step values less than the cutoff, separated by a gap from the rest of the database (the parent haplogroup usually) at higher step numbers.

           The Statistical Background Percent (SBP) is an objective measure of the quality of the type.  Low SBP is taken as evidence that a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP (yet to be discovered).  Larger types with lower gaps have lower SBP.  SBP is intended as an estimate of the background percent of samples in a type that really do not belong to the corresponding hypothetical clade.  SBP is increased to account for the estimated probability of outliers from other clades.  An outlier is a sample that has very unusual STR values due to the luck of mutations.  SBP is also increased to account for the estimated probability of small foreign clades that just happen to have the same STR values but are not closely related to the type.  The SBP is also increased to provide the rough equivalent of the maximum in a confidence interval.  Small sample counts have wide confidence intervals.  So larger types (more samples) automatically get lower SBP.  For a valid clade, SBP should decrease with time as data accumulates in a database.  A very well isolated clade will have a low SBP even with only a few samples.  SBP < 5% is very rare - a very well isolated type, very likely to be a clade.  SBP < 25% is good enough to be published.  SBP < 50% is a type worth watching as data accumulates with time.  The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look like mountains.  The number of markers in the definition should be chosen to provide as small an SBP as possible;  my Excel tools provide automatic rank of markers as an aide;  human judgment can be used to include or exclude markers with obvious problems.  A signature is a small set of markers that rank best, convenient for publication of a type, and for simple demonstration of the correlation of STR values.

           I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes that differ from the modal haplotype by step less than the cutoff.  The definition of a type is the modal haplotype plus cutoff.  The definition uses only those STR markers that provide the lowest SBP, but the definition uses as many STR makers as possible.  The definition of a valid type may change slightly as data accumulates.

           Here are some common terms (in boldface) for genetic genealogy.  I did not define these, although I use them in a restricted sense:  A marker (also “locus”, plural loci) is a DNA location for an SNP or STR or other kind of mutation.  A haplotype is a set of gene values at any number markers, here restricted to Y-DNA STR values.  I use the word sample (plural  samples or data or database) for the Y-DNA STR values from one man.  A sample is also commonly called a haplotype, but I avoid calling a sample a haplotype to make it clear that a haplotype may or may not be present in a particular database of samples.  A clade is a general term for common descent, so an SNP haplogroup is one kind of clade.  I use the word clade in general, when meaning a Y-DNA clade that may or may not be a defined official haplogroup.  All types have associated hypothetical clades, but most clades cannot be isolated as types with low SBP.  A cluster is a set of samples with similar STR values.  All types have associated clusters but not all clusters are associated with types.  The modal value for a marker is the most common value in the cluster.  The modal haplotype is the set of most common values, usually the most common haplotype in a cluster.  Many people use the adjective “modal” as a noun, meaning “modal haplotype”;  so do I;  I tried to avoid that in this web document.

           Not all Y-DNA STR data separates into types because the distribution of STR values tends to be continuous.  A type corresponds to a clade that experienced a population bottleneck - isolation or migration or very rapid population growth.

 

Assignment Rules

 

           For the Polish Project “Y-DNA Results” Page.

           See the “PolishProjectRules” sheet in the Excel File with Assignments.

           Genetic Distance assignment rules for the Polish Project.

           Genetic Distance is calculated following the method of Ysearch.

           These rules represent 80% or better confidence.

           See the Confidence topic for statistical explanation.

 

Technical Discussion of the Table of Assignment Rules: 

           The P type modal haplotype that I entered into Ysearch uses 36 of the 67 standard markers available through FTDNA, so it is called P36.  Of these 36, 17 are in the standard 37 set, so the same haplotype is called P17(37).  Similarly, P15(25) is for data with 25 markers.  Similarly, the N type modal haplotype uses 45 of the standard 67, etc. for the other types.

           If you have some but not all the required data on Ysearch (for example 37 markers plus a panel of 12 more, or for example ancestry.com data with all 37 plus some but not all of the additional markers) then that table of assignment rules will not work properly, because those other markers may increase the genetic distance.

           These assignment rules must be applied in order, top to bottom, within each of the 3 sections.  The A and I samples also match the K rule, because they are subtypes, but they are assigned to A and I because those rules come first.  If you are assigned to A or I, please understand you are also considered a K.  There are no Borderline A or I because all but one of those so far are solid K, and that one exception is K Borderline.

           The Borderline categories are not clades.  The 80% confidence rule does not apply.  For example, a K Borderline sample is one that matches no type with 80% confidence, but matches K at more than 20% confidence.  For more explanation, see the M458 detailed discussion.  See also the “Discussion” sheet in the Excel File with Assignments.

           In my article about Polish Clades, I use provide a cutoff definition for each type.  More definitions are available in Haplotypes.xls. The definitions are wider.  For example, P type is defined as samples that match P36 at less than 5 genetic distance, vs. 4 in the Assignment Rules.  Reason:  samples at the edge of the definition are more likely to be “background” samples that really do not belong to the clade, but just happen to have haplotypes that come close to matching, due to random mutations.  I figure samples at genetic distance 4 from P36 each have less than 80% probability of belonging to P type.  They are included with probability adjustment in the estimated size of P type.  On the other hand, those samples at 4 do not meet the 80% guideline for assignment on the Polish Project web page.  A similar discussion applies to the other Assignment Rules, when compared to my article submitted for publication.

           The Excel File with Assignments has best estimate assignments for all members of the Polish Project, some at less than 50% confidence.  That file is updated from time to time, as new data accumulates in the Polish Project.  That file also has tentative assignments to speculative types not used for the Polish Project.

           Here is a breakdown of the Polish Project, 20 Sep 2009:  I saved this from a previous version of this page as an example.  It

s very tedious to update, but it’s representative.

           1043 members

                       441 R1a1

                                  192 have 67 markers

                                             92 have 80% confidence assignments to P, N, or K

                                                        34 P

                                                        20 N

                                                        38 K in including 7 A and 7 I (no B type on 20 Sep)

                                             17 have 80% confidence assignments to R (confident not PNK)

                                             62 are K Borderline

                                             23 are borderline P, PK, N, or NK

                                  127 have 37 markers

                                             39 have 80% confidence in assignments

                                                        17 P

                                                          2 N

                                                        18 K including 5 A

                                                          2 R

                                             88 are classified U, Unassigned

                                  29 R1a1 members have 25 markers;  2 have 80% assignments;  the rest have U

                                             3 have 80% assignments;  1 A and 2 P

                                             26 are U

                                  91 have 12 markers;  all are classified U

           As more data accumulates on the web, assignment rules will change, because more data means better statistics.  No doubt other types and subtypes will be proposed here.  I’ll probably refine my statistical methods, leading to minor changes.  I expect SNP mutations to be discovered in the future, providing true subgroups of R1a1, with different code names than the P, N, K codes I am using.  My types will be either validated or replaced.  I expect STR data will “stay ahead” of SNP data, so people will still be interested in proposed subdivisions based on STR data.

           The recent M458 SNP is one such new mutation, announced on 4 November.  My previous P type and N type assignments all have been coming out M458+.

           My % confidence for a sample includes both accuracy of assignment and also the probability that I am wrong about that particular type or subtype.  For example, P type is 98% confident.  If a sample is 85% statistically confident of fitting the P clade based on STR haplotype similarity, my net confidence is 98% times 85% = 83.3%.

           The % confidence is a combination of calculations and educated estimates.  The calculations are fully explained in the article submitted for publication, which will soon be linked here.  The estimates are knowledgeable, based upon statistics too complex to calculate and based upon limited data.  The estimates are listed in those articles, but discussed only briefly.  As one example of an estimate:  it is possible that P type is a combination of two about equal sized clades that are distantly related - more distantly than the nodes for P, N, and K - but by coincidence both clades just happen to have similar haplotypes due to the luck of random mutations.  There is no way to calculate such a probability, so I just take such things into account with knowledgeable estimating.  That’s the basis of my 98% estimate for P type validity.  Bottom line:  the % confidence numbers for assignment of samples are not fully explained in the articles submitted for publication.  Please do not misconstrue my probability numbers as fully calculated;  they are just a method to communicate my combination of calculations plus estimates.

           Notice that the K Borderline category is the largest category in the 67 marker data.  That is because the K modal haplotype is almost the same as the modal haplotype for R1a1 as a whole.  It makes sense that many of the unknown subclades in R1a1 have modal haplotypes very similar to the R1a1 modal haplotype, which is similar to K, so of course many samples come close to fitting K without coming close enough to be judged part of K using my “mountain” method.

           Reminder:  use of the assignment rules for Y-DNA known to not originate from Poland may be misleading.

 

R1a

 

           See R1a Subdivision for the current SNP breakout.  Almost all of R1a are R1a1a7 (M458+) and R1a1a* (M458-).  Lawrence Mayka, the administrator of the FamilyTreeDNA Polish Project, assures me by email that all the Polish Project member tests within R1a have been coming out negative for all the rare SNP subgroups.  So if you are a Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the men from Poland.  About half of these - about 1/4 of men from Poland - are R1a1a7.

 

Description of the Types

 

           Click the Ysearch web links in the Results Table for modal haplotypes, which are my best fits of web data to groups of men with similar STR data.

 

           A.  Ashkenazi.  This seems to be a subtype of K.  This type is discussed in my publication, Part II.  I have about 90% confidence in that subtype status, but I am more than 95% certain that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web.  It should be emphasized that not all Ashkenazi match this cluster, and some men in this cluster may not be descended from Ashkenazi.  This type is not restricted to Poland.  Levy-Coffman wrote an article about Ashkenazi genetic genealogy, discussing the hypothesis that Ashkenazi descend from the kingdom of the Khazars, who lived in what is now the Ukraine.

 

           B.  Another subtype of K, recently identified, just now being analyzed and documented.  Concentrated in Poland.  The B data cluster lies at the edge of the K cluster.  The node for B type in the R1a tree might be slightly younger or slightly older than the K definition node.  I estimate the former is about 80% probability - that B is truly a subtype of K;  if not then it probably lies just beyond B.  Individual assignments to B type have 80% to 85% confidence.

 

           C.  Needs documentation.

 

           D.  Concentrated in Poland.  This type was added here on 9 Jan, at which time 7 Polish Project members were assigned to D type plus 9 more to the D Borderline category (STR values closest to D type but cannot be assigned with 80% confidence).  An additional 4 men who match the D Borderline criteria were not assigned because they are close matches to A and I types;  in other words, some of the D Borderline samples are probably D outliers and some might not truly belong to D.  Total 20 samples.  The cluster was brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some time ago, based on the very rare DYS462=12 value.  DYS462 is not one of the FTDNA standard markers;  it is a standard at Sorenson;  DYS462 is available in data on Ysearch.  I did an analysis using the 67 FTDNA markers;  the SBP came out 18.4%, providing 80% or better confidence just on that basis for the best fit samples.  However, 462 would significantly reduce SBP, so confidence in validity of a clade corresponding to D is quite high considering 462.  Only 5 of the samples that fit D type in the Polish Project have been tested for 462 and all 5 have that rare 12 value.  I notice that there seems to be a larger cluster that includes D type plus the borderline samples, but I could not get a reasonable SBP or a consistent definition;  there seems to be too much overlap in D borderline with the large K type and the large K Borderline cluster.  It will be interesting if D Borderline men who fall just beyond D type start testing for 462 in order to find the limit of a hypothetical clade corresponding to 462=12, which may be larger than D type.  D type also has the unusual DYS481=21 value;  only 10 samples in the Polish Project R1a have this value, the 7 who fit D type and 3 of the D borderline.  If those 3 others are shown to have the 462=12 value, that will establish them as solid D type.  Unfortunately, Sorenson does not use the 481 marker, so there are only 3 R1a1 samples on Ysearch with the D type signature pair (462,481) = (12,21);  all 3 are Polish Project members among those assigned on 9 Jan 2009 to D type.  (There are 2 others on Ysearch with this very rare signature pair in other haplogroups - coincidence.)  D type is clearly a Polish type:  In the Polish Project 14 of those 20 samples indicate “Poland” ancestry;  the exceptions are 1 blank, 1 obvious Polish family name, 1 Latvia, 1 Slovakia, and 2 Prussian Poland - Germany.  Within those 481=21, 7 of the 10 indicate “Poland”.  On Ysearch, 5 of the 7 best fits (with D step <6) indicate “Poland”, while at steps 6&7 only 1 of 9 indicates “Poland”.  That is a hint of a non-polish clade close to the edge of D type, which might be the reason the SBP for D type on Ysearch is 22%, not as good as that 18.4% in the Polish Project.  Or maybe this is a hint of a larger parent clade that is not Polish.  D type is very young, about 1,000 years TMRCA, and seems to be composed of subtypes Da and Db (not yet statistically significant).  I classify D as a subtype of K, but see my K Borderline discussion in this regard.  For more details, see the “Documentation” sheet in my analysis file “DType.xls”, at my update folder.

 

           G. This cluster is hypothetical.  It has been discussed on the web by others.  I expect more data will soon justify a discussion here.

 

           I.  Concentrated in Poland.  This type is discussed in my publication, Part II.  About 85% confidence of validity.  About 80% net confidence that both A and I are subtypes of K.

 

           K.  This seems to be a main R1a1a type.  K type is discussed at length in my publication, Part II.  It is larger than others in the Slavic lands.  P and N (below) are just as close in STR values to K as they are to each other, probably because the K modal haplotype is the same as the R1a1 modal haplotype (using the best 34 markers for K).  So far I have discerned a few subtypes of K in my List of R1a types, but I do not have high confidence that they are all exact subtypes of K, as explained in my K Borderline discussion.  I suppose that as data accumulates more subtypes will become clear within K and K Borderline.

           In the Results I use K* to signify those samples that match type K but do not match one of the subtypes.  Although I have high overall confidence in the validity of K type, individual assignments to K* are not as confident.  Because K is located at the modal heart of R1a, I expect some outlier samples from distantly related clades to match K* fairly closely just due to the statistics of random STR mutations.  Because of the possibility of foreign outliers, I consider samples at K step 3 to be K Borderline, even though the cutoff for the K definition is 4.  Even K* samples with step <3 have confidence of only 80 to 90%.  That’s in Poland, where K is fairly well defined with SNP = 26%.  Worldwide K* cannot be discerned with confidence.  The Ysearch SNP for K is 71%, not significant.  That means there are K borderline clades close to the K cutoff that are rare in Poland but causing interference on Ysearch.  This is evident by a glance at the K type results on Ysearch, where “Poland” origin is concentrated at steps <3, and “Poland” becomes progressively less common at higher steps.  A type is a very high confidence subtype of K, so these caveats about K* do not apply to the very high confidence of individual assignments to A type, and similarly to the other subtypes.

           The Kurgans are the ones who domesticated the horse more than 6,000 years ago.  Many scientist think that one pre-Kurgan man is the male line ancestor of all R1a1 men who live today.  The Kurgan hypothesis is controversial, and not necessary for this web page.  You may have noticed that I used the letters of “Kurgan” for my original types and categories during 2008.

 

           L.  This cluster is highly hypothetical.  It is rare in Poland, but second in size to K in European R1a1.  Larry Mayka suggested this cluster to me.  It is a well known Scandinavian cluster.  I quickly checked it briefly, and it seems to be a “type” by my definition.  However, no Polish Project sample match at 80% confidence yet, so I am not yet using it for classification here.  More documentation about L will be available here when I find time to study it.

 

           M.  Needs documentation.

 

           N.  Concentrated in Slavic countries.  This type is discussed in my publication, Part II.  This is a type that according to Yhrd seems to be spread all around the Slavic lands and central Europe, from East Germany to Russia.  N has more mutations than P, so that means it is older.  Within Poland N seems to be slightly smaller than P, but overall N is larger than P.  Previous versions of this page had Na and Nb as speculative subtypes, but I removed those because it seems N type should be properly studied in a database that is not restricted to Poland.  However, I’ll continue to watch the Polish Project, because it will be interesting if more data provides a Polish subtype within N.

           There are web comments about a new R1a1 SNP, to be announced shortly.  My guess is that this new SNP might correspond to the cluster of data associated with what I call N type.

 

           P.  Concentrated in Poland.  This type is discussed at length, in my publication, Part II.  It seems that about 8% of Polish male line ancestry men belong to this type.  According to Pawlowski, this cluster is concentrated in Poland.  I verify Polish types using both Yhrd and Ysearch.  P has fewer mutations than N and K, so it must be younger.  My TMRCA age assessment is 1600 years old, but in light of age caveats P type might be 1 to 3 thousand years old.  Regardless of age, P type seems to have had a population expansion less than 1 thousand years ago.  My publication provides details on the size and age calculations along with evidence regarding the validity of P type.  In my R1a web document, I used P type as an example for a discussion of the caveats associated with TMRCA calculations, and also as an example to explain the possibility of hidden clades, and also as an example for population bias in databases such as Ysearch, so you can find lots more discussion about P type by clicking on those links.  I identified P type and submitted my analysis for publication before the M458 mutation was announced by Underhill.

 

           Pc & Pg.  These subclusters have about 70% confidence, so no assignments yet.  Previous versions of this web page used Pa & Pb & Pe.  The new versions, Pc & Pg, are different, so they got a different subscript letter, although I have modified the same Ysearch IDs.  I have a Pd and other subtypes that are too speculative to mention at this time.

 

           U.  Unassigned.  This is not a cluster, but a holding place for samples with less than 80% confidence in assignment.  U is used in the Polish Project for uncertain samples with less than 67 markers.  Samples with all 67 standard markers are not assigned to U, but instead are assigned into “Borderline” categories, P Borderline, PK Borderline, N Borderline, and NK Borderline.

 

Instructions for Use of Ysearch

 

           Link to the site:  http://www.ysearch.org.  Brief description of Ysearch.

           Click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services.  Or, you can type in your data.  You end up with a “User ID”.

           Ysearch has a Research Tools tab to click, where you can type in other User ID’s for comparison.

Cluster Genetic Distance Method; for:  P - Pc - Pg - N - K - A - I - M - G: 

           Click here:  Research Tools

           Copy the following line into the “UserIDs” bar at the Research Tools page:

        USEID, 8U92G, RQK32, 92HEK, 3SEJK, MN8R3, FCUFG, EKVHX, 24MB4, ZD29Z

           Change USEID to your User ID.

           You need to type the Captcha puzzle for access.

           Click on ‘Show genetic distance report”.  You get a table of results.

           Result:  If there is a small genetic distance result (3 or less) for one of these types, you have a high probability of belonging to that type.  There are more detailed rules available, assignment rules above, followed by several paragraphs of discussion.

           Reminder:  this web page is for men with R1a1a type Y-DNA.  If you are not R1a1a, these instructions will not produce a matching result, except very rarely, in which case the result would be meaningless.

           The emphasis is on men of Polish male line ancestry.  Just about all R1a Polish line men are R1a1a.  Anyone from the haplogroup R1a1a from other countries may get good results, but that may be misleading if there are other types, rare in Poland, not noticed by me, but with haplotypes that overlap one of these 9 types.  Many men of Polish male line ancestry do not match any of these types.  For non-Polish there is a higher probability of not matching any of these type.

 

37 Marker Network

 

           Lawrence Mayka (independently, March 2007) constructed a “median joining network” Network for the 37 marker samples of the Polish Project.  This network supports the definitions of the P & N clusters, and of the A subcluster.  The P cluster is the left side of Mayka’s network;  N is the top branch, and A is a small branch on the lower right.

 

Confidence Percent

 

           This topic explains how I figure percent confidence for assignments of individual samples (men) in the Polish Project.  My publication explains my statistical methods.  There is a summary of my mountain method above.

           Confidence interval example:  By 80% confidence I mean 80% is the lower number of the 80% confidence interval.  For example, 80% confidence might mean that the actual probability is 90% but the 80% confidence interval is 80% to 96%.  As an example, consider a situation where 10 samples match a type with an STR test.  Suppose there is a definitive SNP test available, and 9 of those 10 samples test positive for the SNP, and 1 negative.  That means 9 of the 10 really belong to the haplogroup and that 1 mismatch must come from a different haplogroup that matched the STRs by the luck of mutations.  Next, consider a new sample that matches that same STR test.  What is the confidence that the new sample will pass the SNP test for the haplogroup?  The probability is 90% because we know that 9 out of 10 previous samples like this matched the SNP.  However, 1 out of 10 is a very small sample.  As explained in my publication, I use Poisson statistics for quick calculation of confidence interval.  Poisson statistics is simple to calculate in Excel.  My tool Type.xls has an “SBP” sheet with a set of cells for quick Poisson calculations.

           80% confidence interval of 1 is 0.11 to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80% confidence interval of a match comparing to 9 out of 10 is 61.1% to 89%;  that lower number 61.1% means the 80% confidence ranges to lower than 80%, so net confidence is lower than 80%.

           70% confidence interval of 1 is 0.16 to 3.37, which is 16% to 33.7%, lower number 66.3%;  net confidence lower than 70%.

           60% confidence interval of 1 is 0.22 to 2.99, lower number 70.1%; confidence higher than 60%.

           67.3% confidence interval of 1 is 0.18 to 3.26, lower number 67.4%.  So that’s my one number:  67% confidence.

           In other words, if 9 out of 10 samples that match an STR also match the SNP test, we have 67% confidence a particular future sample matching the STR test will also match the SNP test.

           For 18 out of 20, the probability is still 90%, but a similar calculation shows 75% confidence.

           For 36 out of 40, the probability is still 90%, but a similar calculation shows 80% to 96% confidence interval, net 80% confidence, which is my example that I started with above.  These calculations actually take less than a minute using my Excel cells.

           Statistical Background Percent:  SBP.  I use SBP as a net confidence estimate for the background (samples that match the STR values but really do not belong to the clade of a type).  My publication does not go into the details of confidence intervals.  That is the purpose of the explanation here in this topic.  SBP is my estimate for the net statistical confidence before any SNP has been discovered to validate a hypothetical type.  100% minus SBP is my estimated confidence that a sample in the mountain cluster belongs to the corresponding hypothetical clade.

           A mountain cluster corresponding to a type might include outliers from other clades, or might include foreign clades.  These and other caveats associated with STR prediction are discussed in detail in my publication, where I point out that the confidence for all such caveats cannot be calculated.  I estimate the background by using the low frequency of samples in the gap as representative of the background throughout the haplospace neighborhood.  My SBP formula (available in the tools) includes an increase in SBP to account for all such caveats.

           Part I of my publication explains:  “Much of the background is probably at the last step of the mountain, just before the cutoff.  Much of the remainder is probably at the previous step, much of the remainder after that at the previous step, etc.”  My Part I Table 2 justifies this by demonstrating how the number of possible haplotypes increases very rapidly with step.  In other words, SBP is a good worst case overall estimate of background percent within a type, but background percent is very low at step zero and increases rapidly with step.  My publication does not provide a formula for background vs step and in fact I have not derived an formula.  For assignment of samples, I estimate the confidence vs step in a manner to provide a rapid decrease in confidence near the last step, in a manner to produce overall confidence roughly equal to 100% minus SBP.  Step zero is my rough estimate that the type is a valid clade, since the step zero samples belong to the clade with very high probability if the type is valid.

           Some outliers from the type statistically fall within or even beyond the gap, so confidence is not zero at the cutoff.

           Confidence also depends upon the size of the gap.  A wide gap with zero samples means even samples in the gap near the mountain have reasonable confidence percent.

           Estimates vs Calculations vs Adjustments:  My confidence vs step is a combination of calculation and educated estimates based on experience.  A person who assigns samples to hypothetical clades based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, educated guesses, and intuition.  A bookie’s estimates are usually tested by reality very quickly.  Probabilities of an STR estimator may not be verified or falsified by a new SNP for years.  You need to be skeptical of STR based predictions.  In the past, a number of STR based assignments have been shown wrong by new SNP discoveries.  This long web document is provided so you can read as much as you wish about my methods, judging for yourself the reliability of my estimates and net probabilities.

           The first confidence interval example above, confidence of STR predictions calibrated to SNP data, can be pure statistical calculation without any estimates.  However, judgment is involved.  Even such SNP predictions should be split into parts based on the step value of the samples within a type.  However, if split down to individual steps, the statistics are very poor due to small sample size, so steps are best combined in batches.  For the first data from a new SNP it is necessary to combine all the steps, so the predictions benefit from an estimated confidence by step.  So the judgments and calculations can get quite complicated, and often I just estimate the confidence from experience rather than do the calculations every day as data comes in.

           I avoid changing assignment rules often, so some assignment rules remain in place even after new data has provided better rules.

           My standard is 80% confidence, but I avoid introducing a new type until the confidence is a bit higher, because a new 80% confidence type would provide only a few samples at step zero on the day when enough data has accumulated.  After waiting for more data, I tend to bend the guidelines a bit below 80% confidence in order to introduce more samples with a new type.  Also, if I notice an individual coming out at 75% when I’m updating rules I’ll tweak the rule to include him.

           I tend to be generous in estimates for samples with all 67 markers, and I tend to be conservative with samples having fewer than 67.  I update the rules more often at 67.  After all, samples with fewer than 67 markers can get much better confidence by ordering more markers, and 67 is the most available as a standard commercial test.

           I do not look forward to a man feeling slighted when he is not assigned to a type that is a reasonable fit to his STR data.  On the other hand, I do not wish to be dismissed by others with experience evaluating STR data, so I try to be conservative in my probability estimates that particular clades in fact exist.  I will have achieved my goal if the number of people complaining that I assign too liberally turn out to be somewhat greater than the number of people complaining that I am too conservative (people who have read and understood my documentation).

           Naturally, my confidence changes from month to month as more M458 and STR data accumulates, for better statistics.

           Assignments at fewer than 67 markers:  There are two ways:  Some types have low SBP and seem 80% valid using 37 or only 25 markers, at least for samples at low step, so samples can be directly assigned.

           Second way:  I check for correlation using the samples with 67 markers to see which percent of samples at given genetic distance using fewer markers end up in the corresponding type at 67 markers.  The confidence of a sample at fewer markers is that confidence multiplied by the corresponding confidence at 67 markers.

 

Validation Comments

 

           My P type and N type are equivalent to R1a1a7.  I introduced P, N, and K types in the Fall of 2007, publishing this web page 6 Dec of that year.  I did not predict that P and N were sister clades, in fact it looked to me like P was closer to K, but I had very low confidence in the relation between P, N, and K.  But I assigned samples to P and N with 80% confidence, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008.  I stated my overall confidence in the subtypes of K type as only 80%, but again my confidence at step zero was 95%.

           The new M458 SNP, the definition of R1a1a7 is consistent with (albeit not full proof of) my previous types.  If a significant number of samples from my P or N type were split between M458 positive and negative, that would have been a disproof.  I say “significant number” because STR types are necessarily statistical, so there should be some outliers that are improperly classified in any scheme based on STR markers.  In my case, samples should fit much better than my confidence percent numbers, because these are statistically worst case, as explained in the previous topic.

           I look forward to the discovery of SNPs validating more than 80%, probably about 90%, of my Polish Project assignments to types.

           A new SNP marker may not fall at the node defining a type.  For example, if an SNP for P type is much younger than my P type clade, some of the samples that I classify as P type may not be included in the new haplogroup defined by such a marker.  I would still consider my P type validated if the most of the samples left out have relatively high genetic distance (obviously the older samples in P type).  In such a case, my type would still be valid, as a prediction of yet another, older, SNP marker that may be discovered later.

 

References and Sources

 

           My publications have several references of general interest and relevance to my web documents.

           My Tools and data for STR analysis are Excel files.  These are available at the JoGG publication site as Supplementary Data:  www.jogg.info/52/files/cpcindex.htm. 

           Polish Clades Update.  This folder is for update of Tools and for new data:  www.gwozdz.org/PolishCladesUpdate

           Pawlowski (2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish).  This reference is listed in my publications.  I specifically mention it here because this is where I originally found the common Polish haplotype that I now call P type.  Link to English abstract:  Pawlowski 2002.

           Lawrence Mayka is the Administrator of the Polish Project.  Larry helped me to get started when I was new to genetic genealogy, providing helpful criticism & suggestions.  He reviewed & approved my 80% confidence method for assignments on the Polish Project web page.  He also reviewed the original drafts of my publications.  A number of my types were originally suggested to me as STR clusters by Larry.

           Cyndi Rutledge is the administrator of the R1a Project.  Larry and Cyndi send me M458 test results, which are not listed on the web.

           Anatole Klyosov published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of publications.  Some of the STR types that I independently discovered I later found as 25 marker modal haplotypes in Klyosov’s web documents (before his  publication in JoGG - some in Russian).  It was encouraging to me seeing independent identification of clusters by different methods.  He emailed to me an English version of one of his 2008 publications.  His Fall JoGG articles have references to his other publications.  Here is a web link:  Klyosov Home.

           Russian web sites:  http://www.rodstvo.ru;  http://dnatree.ru/;  http://molgen.org/.  These have been active analyzing R1a, brought to my attention by others, particularly by Mayka.  These sites clearly have proposed subdivisions of R1a based on STR data, but I cannot quickly understand these due to the language barrier;  I tried and failed to figure out a correlation to my types.  Klyosov seems to be active at these sites.  The sites make use of the FTDNA projects and Ysearch.  On 10 Jan Mayka emailed to me a link to some Russian maps of R1a, but I have not yet taken the time to try to figure these out.

           Kenneth Nordtvedt published an article about calculating TMRCA in the Fall 2008 issue of JoGG.  His excel files of data and tools are available at his web site.  Ken has been active in web discussions, suggesting many STR based clusters.

           FTDNA link:  www.familytreedna.com.  This is a commercial DNA testing company.  I make extensive use of the project databases maintained by FTDNA.  These are my primary sources of data.  Click on the “Projects” tab at the home page to look for projects.  Also, the project name can be substituted for /polish/ in the following URL.

           Polish Project link:  www.familytreedna.com/public/polish.  One of many FTDNA projects.  This is my primary source for Polish data.  The Polish Project tracks both Y-DNA and mtDNA;  click on “Y-DNA Results” on the left to see the data that I use.

           R1a Project link:  www.familytreedna.com/public/R1aY-Haplogroup.  Another source.

           Ysearch link:  www.ysearch.org.  Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services.  I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch.  From the FTDNA site, you can register your data with Ysearch.  Or you can type your Y-STR data into Ysearch.  I am not associated with the company FTDNA.  I have Instructions for comparing your STR data to my types (modal haplotypes) that I have entered into Ysearch.

           Yhrd link:  www.yhrd.org.  A forensic Y-DNA data base.  Data is separate by city, with many Polish cities.  I relied on Yhrd to figure out the geography of the various haplotypes.  I wrote a Yhrd Reminders for myself so that I won’t forget how to navigate the Yhrd web site;  click on that link if you need some hints.

           Sorenson link:  http://www.smgf.org/.  Another DNA testing company.

 

Peter Gwozdz;  My Interest

Peter Gwozdz

pete2g2@comcast.net

           I’m a very rare type in Poland - E1b1b1a2.  My maternal 1st cousins are R1a1a.  That means my late maternal grandfather was R1a1a.  I became interested in Y-DNA in 2004.  My maternal family name is Iwanowicz.  I discovered a family with that name in my maternal grandfather’s home town in Poland.  They are the only Iwanowicz family within 50 miles, so I was suspicious they might be my 3rd or 4th cousins.  I brought a cheek swab kit when I visited them the second time in 2006.  Sure enough, the male son is a perfect 25 STR marker match to my 1st cousin.  I didn’t get around to checking the web for a year.  I was shocked to discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12 STR markers.  That’s a hell of a lot of matches in the summer of 2007.  Most of these matches are Polish.  I did some research and found an article by Pawlowski (reference in my publication) about this most common Polish haplotype, which I now call P type.  That got me interested in doing more research, leading to this web page for others to see my results.  My experience, however, is a reminder that statistics can be misleading.  I was confident that my grandfather’s haplotype was P type, based on a perfect match at the first 12 markers.  I now (Dec 2009) figure that the probability was really about 90%, because 9 out of the 10 current Polish Project members who have 67 markers and who also match P type perfectly at 12 markers are in fact P type as judged by all 67 markers.  My grandfather does not match P type at 67 markers.  My grandfather is that 10th one.  He matches the small hypothetical clade that I call I type, which is also concentrated in Poland.  But my confidence on that I type assignment is only 80%, so maybe statistics is fooling me again.  That’s how an outsider ended up studying P type and R1a1a, and writing web pages and articles about  common Polish Y-DNA clades.

 

Explanation of Polish Project Assignments

 

           {5 Jan note.  This is an old topic, which is being rewritten and moved to other topics.  I’m saving this old version here.}

           If you got here by clicking on the link from the Polish Project, here is a brief explanation of the categories used for the R1a section of the Y-DNA Results page at the Polish Project at FTDNA:

           My Excel file has more detailed explanation, including assignments with less than 80% confidence.

                       Link:  www.gwozdz.org/PolishProjectR1a1Assignments.xls

           If you are assigned to the A, B, C, I, K, N, or P Type, I figure 80% or better confidence that your assignment will some day be validated by a new SNP defined haplogroup corresponding to that type.  If you are assigned to the R (Remainder) category, I figure 80% or better confidence that you will end up someday in another new haplogroup, not one of those types that I defined so far.  If you are assigned to one of the “Borderline” categories, I figure less than 80% probability you will end up in the corresponding future haplogroup, but you are close to that type, so if you do not end up in that one, you will likely end up in a closely related haplogroup that is not yet defined.  I use the U (Unassigned) category for uncertain samples where a better assignment is possible by further testing.

           I expect more than 90% of my assignments will be validated by SNPs some day for two reasons.  First, many of the assignments are higher than 80% confidence.  Second, as explained in the Confidence topic below, each of my assignment rules are down rated for 80% sampling statistics confidence, so the average probability is higher than 80% for a large number of 80% confidence assignments.

           Naturally, my confidence changes from month to month as more M458 and STR data accumulates, for better statistics. 

           My net confidence ratings are based on a combination of statistical calculations and educated estimates, as explained in the Confidence topic below.

 

{20 Dec note:  the following is an old version of this topic that needs rewrite to take consideration of the recent M458 results:}

           Here is a longer explanation of my R1a1 assignments on the Polish Project Y-DNA Results page:

           “P Type” is the samples that match with at least 80% confidence to a hypothetical clade that is a subdivision of R1a1.  I call this clade “P type”, or just P for a short name.  P type is highly concentrated in Poland.  A “sample” here means the Y-DNA STR data for a man who joined the Polish Project.  Some of the P samples match with better than 80% confidence, so I predict that more than 80% of the P samples will be verified some day by an official “haplogroup” corresponding to what I call P type.  Haplogroups are also called “groups”.  Official haplogroups are divided into smaller official haplogroups.  “Clade” here refers to all haplogroups, including those not officially discovered.  I use the word “type” for my subdivisions (hypothetical clades, not official), in order to distinguish from official words such as “haplogroup”, “group”, and “haplotype”.  Below on this web page I explain what I mean by a “type”, and how I assess the probability that a type corresponds to a clade.  A “cluster” is a group of samples with similar STR data.  All types have corresponding clusters, but not all clusters are types.  On the Polish Project page, I only provide types for which I have 80% or better confidence in validity.  I consider types at lower confidence below in this page, and in an Excel File with Assignments page.

           Of course, if your sample is listed in the P Type category there is chance your sample will not actually end up in that future haplogroup, because such predictions based on STR data are not certain.  I judge that probability of missing is less than 20% for each sample that I categorized into the P Type.

           Similarly, “A Type”, “I Type”, “K Type”, and “N Type” are samples assigned to corresponding types with 80% or better confidence.

           Please don’t get confused.  These same capital letters are also used for the large official haplogroups.  The I, N, and R haplogroups are particularly large in the Polish Project.  I am dealing here only with haplogroup R1a1, which is a subdivision of haplogroup R.  I am using capital letters to subdivide R1a1.  I expect my subdivisions will be verified someday, with names something like R1a1h1b2, etc.

           A type and I type seem to be subtypes of K type.  So the samples assigned to K are the ones that have 80% confidence of belonging to K but do not match A or I with 80% confidence.  In other words, my K Type is really my K*, where the * notation is conventionally used to indicate samples that belong to a clade, but only those samples that do not belong to a known subclade.

           “R1a - Remainder” is the samples that do not match one of my types, with 80% confidence of not matching any.  I call these R for short.  R is not a “type” per my definition.  The concept of R is very similar to the concept of R1a1* for haplogroups.  We hope that someday haplogroups will be defined for what I call R samples, but right now the STR data does not cluster (correlate) well enough for 80% confident identification of more types.

           Many samples do not match any of my 5 types with 80% confidence but match one of them with 20% or more confidence.  I do not put these into R because they are not R with 80% confidence.  These samples are classified “R1a - Unassigned”, or U for short, if they have less than 67 markers measured.  U is arranged as the last section in the R1a1 section of the Y-DNA chart at the Polish Project web page.  U is the largest classification (most samples) on that page.  U is for the samples that cannot be assigned with 80% confidence.  U is only used for samples with fewer than 67 STR markers measured.

           For samples with all 67 standard STR markers, I do not use U.  Instead, I subdivided the U samples by which type they best match.  So there are classifications such as “P Borderline” and “N Borderline”, only for samples with all 67 markers.  Borderline categories do not correspond to types, because these categories are a mix of samples that probably do not belong to the same type.  Samples in R have 80% confidence, which means each of those matches none of my defined types at 20% or better.  For example, the P Borderline samples match P type poorly, but well enough for me to predict that each of these has a 21% to 79% chance of ending up some day in a haplogroup corresponding to P type.  Summary for this P type 67 STR marker example:  where I figure 80% or more a sample goes into P;  where I figure 20% or less the sample goes into R unless it matches another type;  where I figure less than 80% but more than 20% the sample goes into P Borderline unless it matches another type better.

           This long web document is arranged with simple, important comments first.  Details and complications are discussed later in this document.

 

Revision History

 

2007 Dec 6.  First web posting.  Two files.  This “PolishClades.html” and “PolishProjectR1aAssignments.xls”.  First drafts.

2007 Dec 22 to 2008 Oct 27.  Eleven revisions

2009 Jun 4 minor wording changes, plus mention that 2 articles are scheduled for submission for publication this month.  Also a comment about assignment rules update scheduled.

2009 Aug 14 notice that new assignment rules are available;  this page is being rewritten.

2009 Aug 31 slip rewrite date

2009 Sep 16 start complete rewrite;  not finished

2009 Sep 18 rewrite complete

2009 Oct 3 mention of JoGG submission;  extensive minor wording changes

2009 Oct 22 several paragraphs added above Abstract to explain the new Polish Project Categories

2009 Oct 25 finished editing everything, consistent with new assignment rules - modified the Assignment Table

2009 Oct 26 another edit, with some clarification rewording, mostly in the “Confidence Comments”

2009 Oct 30 edit to mention that the Excel file with Assignments has been updated, using the new rules

2009 Nov 4 add a section at the top.  New M458 split of R1a

2009 Nov 6 change title from R1a1 to R1a.  Also a rewrite of the new section at the top

2009 Nov 11 expand the R1a signature prediction beyond Polish samples

2009 Nov 12 expand the R1a signature prediction to more markers and more signatures

2009 Nov 14 move the R1a Underhill analysis (beyond Poland) discussion to a new web page, R1a.html

2009 Nov 18 move the M458 test results to the R1a page

2009 Nov 21 my method & results published in JoGG, Fall issue, today

2009 Dec 11 minor link correction

2009 Dec 17 comment about rewrite at R1a.html

2009 Dec 20 M458 test results update;  “Assignment News” new first topic

2009 Dec 24 rewrite Abstract, new Introduction and Mountain Method topics

2010 Jan 4 redo of links for References and Sources;  also quite a bit of general rewrite

2010 Jan 5 more general rewrite

2010 Jan 6 update Results Table

2010 Jan 9 D type and more rewrite

2010 Jan 10 update the type discussions

2010 Jan 16 minor changes