# Deduplicate

**Deduplicate** removes any duplicate values from the selected columns in an input table. When removing duplicate values, the tool will remove the entire row of duplicates found.

| Selection           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Fields**          | Select the field(s) in your table to search for duplicates in. Users can choose a single field or multiple, to remove duplicates of combinations of values across fields.                                                                                                                                                                                                                                                                                 |
| **Keep** (optional) | <p>Choose between 2 options:</p><ul><li><p>First <em>(default)</em> </p><ul><li>When duplicates are found in your table, this option will choose to keep the row of the <strong>first</strong> duplicate value found in the set of duplicates</li></ul></li><li><p>Last</p><ul><li>When duplicates are found in your table, this option will choose to keep the row of the <strong>last</strong> value found in the set of duplicates</li></ul></li></ul> |

### Configuration

Deduplicate allows users to easily remove duplicate values across one or multiple fields in the connected table. The Deduplicate tool is set to auto-run, which means that as soon as connecting to an input table, it will immediately run and search/remove duplicates across all fields in the table (aka any identical full rows of data). Simply open the tool and make your *Fields* and *Keep* selections to configure. &#x20;

<figure><img src="https://2577551913-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MW_FvcY52Jcyt8JHFGs%2Fuploads%2FRx73EsrP7O0CEHpQqXG3%2Fdedupe_gif2.gif?alt=media&#x26;token=8b37a0a0-50c9-4579-a694-24131a6000c5" alt=""><figcaption></figcaption></figure>

As seen above, the *Fields* prompt will allow you to choose from a dropdown list of all of the columns in your table. Select one or more to run the deduplicate tool on.&#x20;

When configuring your Deduplicate tool, you will see two rows of numbers dynamically changing in between the *Fields* and *Keep* options.

<figure><img src="https://2577551913-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MW_FvcY52Jcyt8JHFGs%2Fuploads%2Fop1XcOlP5e8YCo1ASDA0%2FScreen%20Shot%202022-09-22%20at%201.28.43%20PM.png?alt=media&#x26;token=21e05fcf-ad8a-42e5-a905-71dfb9ed4248" alt=""><figcaption><p>Row counts represent the table size before and after removing duplicates</p></figcaption></figure>

Finally, make your *Keep* selection - decide whether you want to keep the row of the first instance of your duplicates or the last.&#x20;

### Example

Let's say we have a dataset of baseball players that needs some clean up. The dataset has taken in data from various sources and as a result has multiple duplicate values. To better show how the tool works, we've color coded the duplicate values.&#x20;

{% embed url="<https://datawrapper.dwcdn.net/6N2lX/2/>" %}
I
{% endembed %}

After connecting the above table to the Deduplicate tool, we'll configure by selecting `playerID` in the *Fields* prompt and "First" in the *Keep* Prompt.&#x20;

<div align="left"><figure><img src="https://2577551913-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MW_FvcY52Jcyt8JHFGs%2Fuploads%2FpIUuDeof4CSgIGx1ga2P%2FScreen%20Shot%202022-09-22%20at%202.43.09%20PM.png?alt=media&#x26;token=e55a92b3-6c03-4fda-a347-ca32965932db" alt=""><figcaption></figcaption></figure></div>

<div align="left"><figure><img src="https://2577551913-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MW_FvcY52Jcyt8JHFGs%2Fuploads%2Fs1GUePvl5VX2ah7c2DnS%2FScreen%20Shot%202022-09-22%20at%202.42.56%20PM.png?alt=media&#x26;token=5613f4e0-d837-4cbf-af0b-1e65c5a6e612" alt=""><figcaption></figcaption></figure></div>

As a result, we'll get the table below as our output. As you can see by the colors and data of the rows remaining, the Deduplicate tool removed all rows where there were duplicate `playerID` values and kept the first row of each duplicate value.&#x20;

{% embed url="<https://datawrapper.dwcdn.net/c0zP9/1/>" %}

### Outputs

The Deduplicate tool outputs two tables: one with all duplicate value rows removed and one with all duplicate value rows.
