Results

Common Polish Y-DNA Clades

7 Aug 2018

Peter Gwozdz

pete2g2@comcast.net

 

Notice

            My xls analysis files are no longer posted on the web, so the links to these files do not work.  I’ll be rewriting this page to remove specific links to my STR analysis data;  it may take me a few months to finish a full rewrite.

 

Polish Y-DNA Haplogroup Summary Table

            18 Oct 2016 Add L1237.  19 Oct 2016 rename I1-Vistula.

            Full update 18 Sep 2016.

            This table displays haplogroups that are most common in Poland.  The table is arranged as a tree laying down, with branches going to the right.  The height of each branch is proportional to the percent frequency.  Haplogroups smaller than 1% are not shown, so only the first column largest branches add up to 100%.

            More detailed discussion is below the table.      There are statistical issues and bias issues regarding tables such as these, so the percent frequencies should be viewed as approximate;  percent is given to one decimal place in order to clearly communicate relative frequencies in the database analysis.

            For full details of the data and analysis, see the file ResultsTable.xls.

            For a more basic discussion about what all this means, please see the Abstract in the main web page.

 

Haplogroup

%

Haplogroup

%

Haplogroup

%

Haplogroup

%

Haplogroup

%

Haplogroup

%

   

   

   

   

   

   

   

   

   

   

   

   

 

 

 

 

↑C

↑0.3%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CEFGH

8.1%

E (M96)

5.1%

E1b1b1

5.1%

E-V13

 

 

2.6%

 

 

 

 

 

 

(M35)

 

E-Z827

 

 

2.0%

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FGH

 2.7%

G (M89)

2.6%

G2a (P15)

 

 

1.6%

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I1a-Y6340

 

I1-Vistula

 

 

2.1%

 

 

 

 

I1a (DF29)

6.3%

I1a-L22

 

 

 

 

1.2%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I1a-Z63

 

 

1.9%

I1a-L1237

 

1.3%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C_to_Q

 

 

 

 

 

 

 

 

 

 

 

37.7%

I J

22.9%

I2 (P215)

9.2%

I2-P37

7.1%

I2-S17250

3.8%

I2-Y4882

1.8%

 

 

 

 

 

 

 

I2-Y4460

 

 

1.0%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I2a2 (M223)

1.8%

I2-L801

 

 

1.4%

  

  

  

  

  

  

   

  

  

  

  

  

 

 

 

 

 

 

 

 

J1-YSC234

 

 

1.1%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

J1 (M267)

2.6%

J1-Z18297

1.3%

J1-L816

1.1%

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

J (M304)

7.4%

J2 (M172)

4.8%

J2a (M410)

3.2%

J2-M67

1.8%

 

 

 

 

 

 

 

 

 

 

J2-Z2222

1.2%

  

  

  

  

  

  

  

  

      

  

  

  

 

 

 

 

 

 

 

 

J2-M575

1.6%

J2-M241

1.5%

   

  

 

   

   

   

   

   

   

   

  

 

  

  

  

  

  

  

  

  

  

  

  

  

   

    

 

 

 

 

 

 

 

 

  

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NY5580

1.1%

 

 

LTNQ

6.7%

N (M231)

5.2%

N-L550

4.8%

N-M2783

4.1%

N-Z16975

1.5%

  

  

 

 

 

 

 

 

 

 

N-Z16981

1.1%

 

 

 

 

Q1 (L232)

 

 

 

 

 

 

1.1%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

  

  

  

  

  

 

 

 

 

 

 

 

 

R1a-YP1337

 

 

 

 

2.0%

 

 

 

 

R1a_L260

 

 

 

R1a_Y2905

4.3%

 

 

 

 

 

 

 

 

 

 

 

R1a-YP3927

1.3%

 

 

 

 

 

 

 

 

R1a-Y4135

 

 

1.2%

R1a1

49.8%

R1a-

22.7%

P type

12.9%

R1a-YP254

9.1%

R1a_YP414

 

 

3.6%

 

 

 

 

 

 

R1a-YP654

 

 

 

 

1.8%

(M459)

 

M458

 

R1a-

9.4%

R1a-L1029

 

 

7.1%

 

 

 

 

 

 

 

 

 

 

 

 

R1a-YP263

1.3%

 

 

 

 

 

 

 

 

 

 

R1a-YP444

1.1%

 

 

 

 

CTS11962

 

 

 

 

 

R1a-YP593

2.4%

 

 

 

 

N type

 

R1a_YP515

 

Np cluster

 

 

2.2%

 

 

 

 

 

 

 

 

 

 

 

 

  

  

  

  

R1a-Z92

  

  

 

  

4.8%

R1a-YP351

3.4%

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

R1a-YP343

 

 

1.9%

R1a-P278

1.0%

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

 

 

Y2613 D Type

2.2%

R1a-Y2608

2.0%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

 

R1a-YP269 Gb type

1.2%

 

 

R1a-

 

R1a-

 

R1a-

 

R1a-YP237

7.0%

R1a_YP389_G_type

2.0%

 

 

 

 

 

 

 

 

 

 

R1a-YP977 J type

1.8%

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

 

 

 

 

R1a-L1280

1.0%

 

 

Z280

23.7%

CTS1211

18.3%

CTS3402

15.9%

R1a-Y33

6.9%

R1a_S18681_I_type

2.4%

 

 

 

 

 

 

 

 

 

 

R1a-Y2902  B type

3.2%

 

 

 

 

 

  

 

 

 

 

  

  

 

 

 

 

  

 

 

  

 

  

  

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R1a-Z93

 

 

 

 

 

 

2.9%

R1a-Y2619  A type

1.7%

   

  

  

   

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

 

 

 

 

 

 

   

  

  

  

  

  

  

  

  

  

  

  

R1b1a

 

R1b-

 

R1b-U106

4.5%

R1b-L48

3.5%

R1b-L47

1.9%

R1b-L47 P type

1.5%

 

 

 

 

 

 

R1b-Z9

 

 

1.6%

 

 

 

 

 

 

 

 

 

 

 

 

  

  

  

  

  

  

  

  

  

  

  

  

 

 

 

 

 

 

 

 

R1b-DF27

1.3%

R1b-Z196

1.1%

  

  

  

  

  

  

  

  

  

  

  

  

(L754)

12.4%

M269

12.3%

R1b-P312

 

 

4.7%

R1b-U152

2.1%

R1b-L2

1.7%

 

 

 

 

 

 

 

 

  

 

 

  

R2↓

↓ 0.2%

 

 

R1b-Z2103

 

 

 

 

2.1%

R1b_Y5587_L23EEType

1.9%

 

 

  

  

  

  

  

  

  

  

  

  

Total

100%

 

 

 

 

 

 

 

 

 

 

 

Discussion of the Haplogroup Summary Table

            Rewrite 20 Feb 2016.  Edit 9 Jul 2016.  Edit 18 Sep 2016.

            The table shows the percent samples (men) for each haplogroup.  The total database:  those samples in the Polish Project that indicate “Poland” as the earliest known male line ancestor’s origin.

            There are issues involved with “Poland” because the borders of Poland have changed often, and Poland as a country did not exist for more than a century;  I discuss some of these issues in the two “Poland” topics below.

            There are statistical issues involved with percentages calculated from such a database;  I discuss some of these issues in the Statistical topic below.

            This new version also has many new haplogroups, corresponding to new SNPs.  There has been a flood of new SNPs over the past couple years, so this new version has more detailed categories.  Watch this page for future updates.

            For more details, my analysis file ResultsTable.xls is also available on-line.

            I tried a few versions of different browsers to view the Table above.  Almost all browsers display the table as I intend, but there are exceptions with minor problems.  Sorry if your browser has display problems, for example by including lines within the cells, where I intend each cell to represent a haplogroup, solid color, bordered by lines in a rectangle.  The cells to the right are subdivision haplogroups.  Like a tree, but with branches going right.  Some common Y-DNA haplogroups are not included because they are not common in Poland.

            Very small haplogroups, and residual paragroups, are white, not displayed as cells, no percent displayed.

            Several haplogroups were discussed in my web pages as hypothetical clades, called types before the corresponding SNPs were discovered.

            The height of each haplogroup cell is proportional to the percent.  The full table is 10 inches, so 1% is a tenth of an inch.  1.0% is my minimum cell for display.  Much smaller if viewed on a cell phone.  However, you can view my complete listing of all categories in my Excel analysis file, which is discussed in more detail in the following topic.

            The font of the table can be small on cell phones or tablets, but you can use your fingers to expand to read, and then move around the table.  On desktops, change the Zoom setting as needed to view the small font cells.

            The left side of the table is based on all samples, including those with only 12 STR markers, so the statistics is better on the left.  The far right of the table is based proportionally on only those few samples for which the appropriate SNPs have been tested, because those haplogroups cannot be predicted by STRs, so the right side is based on fewer samples, which means lower accuracy.  Please consider the far right column as estimates based on few samples;  the accuracy will improve in the near future as more samples are tested for these new SNPs.  The central region of the table is intermediate, because samples with 67 STR markers can be confidently predicted to the larger branches even without SNP data.  For more details, please see the following topics.

            The ISOGG (http://isogg.org/tree/) codes for the largest branches in the table are based on the ISOGG tree July - Sept 2016;  the ISOGG codes change as new SNPs are discovered.  Most of the table does not have full ISOGG codes because they are not practical for small branches.  The Yfull tree (http://www.yfull.com/tree/) was used for haplogroups not listed by ISOGG.

 

Previous Versions

            New Topic 18 Sep 2016.

            The Nov 2013 version of this table is still available at Results2013.html.  That old version is based on the entire Polish Project, taken as representative of historical Poland.  You can compare the different percent results for that old version to this current version, which is restricted to “Poland”.  The Mar 2015 version, also restricted to “Poland” is also available for comparison of how the % values change as data accumulates;  see Results2015.html.

            The Jan 2016 version is also available,  ResultsJan2016.html.  That was the first update using Big Y results to report many small twigs of the Y-DNA tree, based on new SNP discoveries.  I did not make adjustments for STR assignment bias in the Jan 2016 version, discussed below;  I did make adjustments for STR assignment bias in this most recent July 2016 version.  You can compare these two 2016 versions to see the effect of this bias.

            I call this the July 2016 version because 6 July is the date I downloaded the Polish Project data for analysis.  It takes me 1 or 2 months of spare time work to do the analysis and update this Results document, so these results were not posted on-line until 18 Sep 2016.  The Jan 2016 version was downloaded 18 Jan and posted 24 Feb 2016.

 

Details of the Analysis File

            Rewrite 18 Sep 2016.

            Please refer to my analysis file ResultsTable.xls.  That file provides the data for the Haplogroup Summary Table at the top of this web page.  That file has 20 sheets.  It opens to the final sheet, on the far right, which is data for this web page, arranged in 6 pairs of columns.  Column Q is the most detailed result, corresponding to the far right column in the table at the top of this web page.  The individual cells in column Q show the % value, but that cell actually holds a pointer to one of the other sheets where the corresponding value is determined.  The first sheet, on the far left, “Notes” has the explanations of how the analysis was performed.

            Public Data:  The Polish Project had 3,628 samples listed in my download on 7 Jul 2016.  I edit for family sets, which left 3,510 samples.  1,681 samples indicate “Poland” either in the “Country” column or as an addition to the “Paternal Ancestor” column.  These 1,681 samples are the input data to that ResultsTable.xls file, listed in the 2nd sheet, Category Counts, where they are totaled by assignment category.

 

Polish Project vs “Poland”

            Edit 27 Feb 2016.  Edit 18 Sep 2016.

            In my Fall 2009 publication I discuss reasons for considering the Polish Project as representative of historical Poland.  Accordingly, my table of haplogroups have in the past been based on all the Polish Project samples with Y-DNA STR data.  Since 2015, I restrict the data to samples that indicate “Poland” as the place of origin for their earliest known male line ancestor.  There are issues with the use of “Poland”.  These issues are discussed in the next topic.  This topic provides my reasons for the change.

            The basic reason:  the Polish Project now has too many samples from the countries bordering Poland.  Many if not most of these are probably not from the region of historical Poland.

            There are good reasons for allowing men to join the Polish Project even with ancestry from regions somewhat beyond the largest borders of historical Poland.  The Y-haplogroups for these regions are present in Poland, and more data is helpful in the task of determining the branching of haplogroups.  Discovery of the branching SNPs is one goal of the Polish Project.

            Another reason:  Mayka has been doing a great job of analysis of Polish Project data, and keeping the web page up to date.  This has attracted men from border countries to join the project, in order to learn their haplogroup branch, and in order to see the names of their distant male line relatives.  This fills another goal of the Polish Project.  Some of the neighboring countries do not have such well managed projects.

            A big reason:  Effort:  It takes a lot of effort for Mayka to determine which samples are not from the region of historical Poland.  In most cases in the past, the men needed to be contacted by email.  Mayka had been analyzing such  samples, and flagging them as “No Public Display” so they were not displayed in the public web pages.  With the growth of the Polish Project, this effort has become too great.  Some samples obviously not connected to the region of Europe near Poland are still flagged with “No Public Display”.

            I mention above that more data is helpful.  Restriction to “Poland” means less data, which means less statistical accuracy.  This was another of my reasons for considering the full Polish Project in the past.  The Polish Project has grown a lot over the years, however, so the restriction to “Poland” is no longer a very serious issue.

            There are other Y-DNA projects that could be combined with the Polish Project data to produce a larger “Poland” database.  However, these other projects concentrate on specific haplogroups or on other countries;  I know of no way to combine projects and adjust for bias to particular haplogroups.

            Men with “Poland” provided for origin who do not join the Polish Project may be relatively less certain of their Polish male line ancestry, so the Polish Project may well serve as a more concentrated database regarding Polish ethnicity.  Ethnicity is discussed in the following topics.

            There are issues involved with using self assignment to “Poland” as the criterion for a database.  These issues are discussed in the following topic.  However, there are also issues in using self assignment to a geographical region as the criterion, so my recent change to “Poland” is primarily a change of issues and interpretation, as discussed in the following topics.

 

Issues with “Poland”

            Edit 21 Feb 2016.

            For the Haplogroup Summary Table, my database is those samples from the Polish Project that indicate “Poland” as the earliest known male line ancestor’s origin.  There are issues with this use of “Poland”:

            Partition of Poland.    “Austria” or “Russia” or “Prussia” or “Germany” were used by many Polish immigrants in other countries (mainly USA around 1900) in documentation of place of birth, because Poland did not exist from 1795 to 1918.  Poland was split three ways.  The Austrian sector was in the south.  The Russian sector was in the east.  The Prussian (now German) sector was in the west.  Some men, when submitting their DNA data, correct this and enter “Poland” if they consider their male line Polish, while others who consider their male line Polish just copy the documentation for origin as reported in the immigration document.  The uncorrected data reduces the “Poland” database and thereby reduces the accuracy of haplogroup percentages.  Data may have a bias if correction from certain sectors is more common.  For example, maybe Polish immigrants from Russia said “Poland” as their origin less often than Polish immigrants from Austria said “Poland”;  or vice-versa;  we don’t know which way it might bias.  Insofar as some haplogroups are more common in some sectors than others, this may introduce a bias in the percentages of haplogroups in Poland.

            In the time between WWI and WWII, a large part of what is now Belarus and Ukraine was in the eastern part of Poland.  Again, some men who consider their male line Polish may correct their documentation to “Poland”, if the documentation says “Belarus” or “Ukraine”.  Here, there is also the opposite problem:  some men from Belarus or Ukraine who do not consider their male line Polish may enter “Poland” if their documentation so indicates.  Similarly, some of what is now western Poland was Germany in the past.

            Commonwealth.  During the centuries before the partition of Poland, Poland was a “Commonwealth” with a much larger area.  The Commonwealth of Poland was in existence well before the era of nationalism;  the Commonwealth was more like the United States today;  people who lived in the Commonwealth were considered Polish because they lived in the Commonwealth.  These people were a mix of many different nationalities (ethnicities).  Up until WWII, in the region considered “Poland”, there were many neighborhoods of people who spoke a language other than Polish, with a different culture;  these people may or may not have considered themselves Polish.  Similarly, there were many neighborhoods of people who spoke Polish in countries outside the then current Polish borders.  Today, because of WWII and the aftermath, Poland is not very ethnically diverse by comparison.  Haplogroup J, in particular with regard to Ashkenazi Jews, is an example of a haplogroup over represented in the table above in comparison to modern Poland, because of diaspora from outside Poland who enter “Poland” for male line ancestry.  I did not edit in this regard;  I suppose many Ashkenazim in previous centuries considered themselves Polish, so in a sense the table above may be more representative than Poland itself - of who considered themselves Polish over the centuries.

            Geographical vs Ethnic DNA data.  In this sense, the Polish Project is more representative of historical Poland than current Poland.  The Polish Project is not solely representative of ethnic Polish people.  Restriction to “Poland” samples provides a more ethnic database.  More discussion about this is below.

            Polish family names can be seen in DNA data with male line origin given as USA, or England, or other countries not close to Poland.  Obviously, these are mostly men who have not found documentation for their very distant ancestors, so they state the birthplace of their known ancestor, perhaps a father or grandfather.  I considered increasing the database by including data with Polish family names, but that is difficult to do, and that produces a problem that I consider overwhelming:  many Polish people have German or Russian sounding names even though their ancestors were all Polish for as many generations back that they can trace.  Some of these may be male lines that moved to Poland long ago from elsewhere, but many if not most of these are probably male lines where the family names were Germanized or Russianized, in the same way that many Polish names today are Anglicized in the USA.  I saw many instances of such family name changes in 19th century Poland when studying the Polish microfilms of vital records while researching my family tree.  So I do not add Y-DNA data from other countries given as origin based on Polish names, and I do not remove “Poland” data for names that to not sound Polish.  Such a name based study may be an interesting experiment.

            The Polish Project has a very low percentage of samples with non-Polish names, from a man who knows his father or grandfather was adopted from an orphanage, and where there is suspicion that his male line is Polish.  Or other similar special situations.  Although rare, these indicate how difficult it would be to restrict Polish Project membership without the danger of excluding someone with a genuine interest in the Polish Project.

            Geographical issues such as the previous paragraphs are not unique to Poland.  The same types of issues come up in an analysis of Y-DNA from most countries.  But these issues are particularly applicable to Poland, because of the wide swings in the border of the country of Poland over the centuries, and because of the large Polonia diaspora, and because Poland was very cosmopolitan in the past.

            Biskupski published a delightful essay regarding the ambiguity of the word “Polish”.  (Biskupski M, “Who is a Pole and Where is Poland?”  Rodziny - The Journal of the Polish Genealogical Society of America (PGSA) Summer 2006:5-12 - can be purchased through PGSA.)  Biskupski considers all these issues, and also other ambiguities that I do not cover in this brief discussion.  Biskupski does not provide a definition of “Polish”;  in fact he avoids such a definition, preferring to let people define themselves as Polish because “it feels right”, quoting Tuwim:  Yestem Polakiem bo tak mi się podoba.  I am a Pole because that feels right to me.  Note that I follow Biskupski’s advice in the paragraphs above with the phrase “consider their male line Polish”.  Biskupski points out the problem with restricting “Polish” to the current borders of Poland:  with such a restriction famous Poles such as Kosciusko, Sobieski, Paderewski, and others would not be considered Poles.  Biskupski also points that out if we expand the borders to historical Poland, then Tchaikovksy, Dostoevski, Nietzsche, and others become Poles.  Clearly, borders cannot be used alone as a supposed definition of who is a Pole.

            Self assignment as male ancestral origin “Poland” can of course be a problem for any statistical study.  No doubt some people are wrong about their male line ancestry, or self deluding.  It may be that men uncertain of their origin may be more likely to consider Y-DNA tests and more likely to join the various projects, so maybe DNA projects have a higher concentration of uncertainty than the general population.  On the other hand, men who state “Poland” as origin are surely more likely to be of Polish ethnicity than men in the Polish Project who state another country.  Conversely, men who do not consider themselves Polish may join the Polish Project because their male ancestor came from the region of historical Poland, and those men are more likely to not state “Poland” for the origin.  In this regard, my tables since 2015 are more ethnically Polish than my previous tables.  I have evidence for this, in the following topic.

            Ethnicity.  Some sociologists deny the genetic validly of the very concept of ethnicity because of all the migrations and mixing over the millennia.  I consider my analysis of Polish Project data to be my contribution to the measure of the degree of genetic validity of “Poland” as an ethnic term.

            Accuracy and uncertainty.  It seems to me all these issues about the meaning of a self assigned “Poland” ancestral origin may provide the largest uncertainty in the percentage frequency results reported here.  Certainly in the largest branches (to the left in the Haplogroup Summary Table), where the number of samples is large, and where statistical accuracy is best, the ambiguity of “Poland origin” is the largest uncertainty of the results.  I have no method to quantify this large uncertainty.  On the far right, with branches as small as 1%, statistical uncertainty surely dominates, which is quantified below.

            Justification.  Despite all the caveats discussed in this topic, the Polish Project Y-DNA “Poland origin” data is the best database I know of for the analysis of Y-DNA haplogroup frequencies for Poland.

 

Data Analysis

            Edit 22 Feb 2016.  Edit 18 Sep 2016.

            I use L260 as an example;  the same kind of analysis can be done for many of the haplogroups in the table at the top of this page, regarding evidence of ethnic concentration.

            L260 is the largest Y-DNA haplogroup with significant concentration in Poland and with much lower concentration elsewhere.  Before L260 was discovered, I identified this clade (hypothetical haplogroup) with the name P type.  In my Fall 2009 publication I reported the frequency of P type Y-DNA as “about 8%” in Poland.  In Fall 2010 (same journal, JOGG) I reported the discovery of L260 and showed that it seems equivalent to P type, with very few P type outliers (samples that marginally match P type with 67 STRs but test negative for L260, and even fewer L260+ samples that do not match P type.).

            In the table above, L260 P type is much higher, at 12.9%.  This is primarily because the data is restricted to “Poland” origin samples.  That previous 8% was using the full Polish Project, representative of historical Poland.  In other words, L260 is significantly concentrated in ethnic Poles, even within the geographical region of historical Poland.

            Other haplogroups, less frequent among ethnic Poles, are coming out lower in the new table.  These can be seen by comparing the current table at the top of this web page to my 2013 table at Results2013.html.  Examples:  Haplogroup E was 6.0% in the 2013 Polish Project, but is 5.1% now when restricted to “Poland”.  Haplogroup N was 8.8% in the 2013 Polish Project, but is only 5.2% now when restricted to “Poland”.  Both E and N are widely distributed European haplogroups, which apparently do not happen to have any large sub group concentrated in Poland.  In the near future we may have enough data to identify very small E and N branches that are concentrated in Poland, although it will be very difficult to associate these with specific migrations long ago.

 

Ethnicity Comments

            Edit 28 Mar 2015.

            I am aware that Y-DNA data can be misinterpreted with regard to ethnicity.  There has been a lot of migration of ethnic peoples in Europe over the centuries.  Ethnic peoples have mixed.  DNA has been scrambled due to recombination, but Y-DNA is unusual because the Y chromosome does not recombine (except for a very small part).  The Y chromosome alone does not indicate ethnicity for individuals.  Each of us has more than 1,000 ancestors about 300 years ago.  At 10 generations, 2 to the 10th is 1,024, so each of us has 1,024 branches in our pedigree tree for ancestors going back 10 generations.  30 years is a bit high per generation on the average, so the number of ancestors is greater than 1,024 at 300 years.  Our Y chromosome is associated with our male line, which is only one of those 1,000+ ancestral lines 300 years ago.  Going back more than 300 years is much more dramatic.

            That said, the Y is interesting because certain Y haplogroups are concentrated in certain geographical regions, which is evidence of concentration in the past, and which is a hint regarding origins and migrations.  P type (L260) is one of the more dramatic, with high concentration only in Poland, among ethnic Poles.  Obviously, the mixing of people was not enough to homogenize the Y chromosome across the continent, or even across the region of the Polish Commonwealth.

 

Statistical Issues

            Rewrite 23 Feb 2016.  Edit 9 Jul 2016.  Edit 18 Sep 2016.

            Small number of samples:  When % frequency of a small haplogroup is calculated, based on a small number of samples, there is a well-known uncertainty in the result, expressed as a confidence range.  Poisson statistics is used for calculation.  I explain this in my Fall 2009 publication about types, and I provide a macro for calculation of confidence range in my Excel files for types;  the master file is Type.xls, with a demonstration of the Poisson statistics macro on the far right of sheet “SBP”.  As a small sample example, the very short cell at the top of this web page, haplogroup C (top haplogroup, not colored), has only 5 samples, with frequency 5/1681 = 0.297%.  The 70% confidence range is 0.17% to 0.51%;  the 90% confidence range is 0.12% to 0.63%.  In other words, future data for haplogroup C will likely come out lower than 0.249% or higher than 0.249%;  we are not confident of the exact frequency in the large worldwide population of men with Polish male ancestry who might join the Polish Project.

            Technical caveat:  my macro only considers the number of samples in the haplogroup or type, not considering the uncertainty due to the number of samples in the database, because the latter is very large in this case.  My macro understates the confidence range for a frequency where both the numerator and denominator are small.

            In general, the left side of the table is based on hundreds of samples and the center of the table is based on dozens of samples.  Uncertainties due to number of samples are very small compared to the other statistical issues, below, and compared to the non-random biases discussed above in the “Poland” topics.

            On the right side confidence range is wider.  For example 1.0% corresponds to 17 samples.  70% confidence range is 0.76% to 1.33%.

            Some of the short white rows (cells not colored, no text with percent) in that table are based on fewer than 10 samples, with wide confidence range.  The taller white cells have better confidence, based on a paragroup combination of small haplogroups.

            The uncertainty for haplogroups at less than 1% is the main reason I restricted the Haplogroup Summary Table to haplogroups with at least 1%.  However, there is a complication even for some haplogroups at greater than 1%, due to proportional distribution.  As an example, consider the haplogroup YP3927, listed at 1.3%, near the center right of the table above.  This branch is a small twig in the branch sequence R1a-M459-M458-L260-YP254-Y2905-YP3927.  1.3% normally represents 22 samples in the full database (1.3% x 1681).  However, R1a-YP3927 has only 5 samples, so the confidence range is wide - the same as the example above, haplogroup C, also with 5 samples.  The reason:  Most L260 samples have not been tested for all the recently discovered SNP branches of L260.  The L260 samples that have been tested are used as a database to determine the proportional distribution of the L260 percent, assuming future SNP test results will be proportional to recent test results.  The details are in the sheet “L260” of the file ResultsTable.xls;  YP3927 is the last line of that sheet.

            This YP3927 is an example of a result where the accuracy is dominated by the small sample number statistical uncertainty.  The non-random systematic uncertainties discussed in topics above, due to what it meant by “ethnic Poland”, and due to other issues mentioned in this web page, are probably larger that the statistical uncertainties for most clades.

            I have not calculated and tabulated all confidence ranges here:  I do that in my table with Poland Concentration Index for the haplogroups most concentrated in Poland.  That “PCI” table needs to be updated.

            STR Assignment Bias:  This was a statistical bias in my Jan 2016 version.  Some haplogroups can be better predicted on the basis of STRs than others.  For this version, when dividing a haplogroup into branches, I used 12 STRs for the left side of the table (largest and oldest branches), but only for those branches where all subdivisions of a given branch are valid using only 12.  For the right side of the table (youngest and smallest branches), I did not use STRs.  For the middle of the table I used 67 STRs only for division R1a1, where subdivision into branches can be done with statistical confidence.  For details, in ResultsTable.xls, see the two sheets “R1a1 67 Category Counts” and “R1a1 67”.

            Recruitment Bias; Edits for Family Sets:  My explanation of editing for Family Sets is at the main web page. Briefly, family sets produce a recruitment bias to haplogroup frequencies.  This is a non-random statistical issue that can be reasonably corrected by editing.  A small unknown accuracy issue remains insofar as the edit for family sets is not perfect, as explained at the main web page.

 

Revision History

2011 Sep 10 Initiate this document by moving table from PolishCladesl.html

2011 Four updates after Sep 10

2012 Feb 8 update the first 3 columns of the Summary Table

2012 Feb 11 finish update of the Summary Table

2012 Feb 12 update the detailed Results Table, except R section is not  updated yet

2012 Feb 13 new Table of Largest Clades in Results

2012 Feb 14 Results update finished

2012 Feb 15 multiple edits

2012 Mar 8 Kx correction

2012 Mar 8 new columns for analysis file links;  some links added;  not all yet

2012 Mar 9 finish analysis files links

2013 Nov 6 Update of the Summary Table (full table not updated yet)

2013 Nov 8 “Largest Categories” new topic

2013 Nov 17 “Largest Terminal Haplogroups and Paragroups” new topic

2013 Nov 18 Update of Summary Table and Largest Categories

2013 Nov 19 Update of Summary Table and Largest Categories

2013 Nov 20 Update of Results Table - the long one

2013 Nov 21 Update; minor edits

2015 Mar 20 new web page “ResultsNew.html” with complete update

2015 Mar 22 a few more subdivision cells added to the table.

2015 Mar 23 more subdivision - finished table;  “Details of Analysis File” finished

2015 Mar 24 more discussion added below the table

2015 Mar 25 more discussion added below the table

2015 Mar 26 discussion topics finished;  this new web page is complete for now

2015 Mar 28 Minor edits throughout this page

2016 Feb 24 Complete Update

2016 Feb 25 Minor edits of table

2016 Feb 27 Minor edit

2016 Apr 1 Edit CTS11962 section;  correct statistics for YP515 (smaller)

2016 Apr 4 Edit YP351, included in the table

2016 Apr 14 Minor edit

2016 Apr 18 Another minor edit, error caught by Milewski

2016 Apr 27 Edit Z93 A type

2016 Jun 30 Correction B type from 3.1% to 3.3%

2016 Jul 5 Correction D type, Y2608 from 1.5% to 1.8%

2016 Jul 9 bookmark “Public Release of Data” with discussion;  database no longer displayed in the xls file

2016 Sep 18 Update of the Results Table with July 6 database;  edit of multiple topics

2016 Oct 18 Add L1237

2016 Oct 19 Rename I1a-P type to I1-Vistula, the name being used by Kozietulski.